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Executive  summary 

The  project  (FA8655-12-1-2096)  Towards  Robust  Multiagent  Plans  with  its  extension  Domain- 
Independent  Multiagent  Planning:  Models,  Stability,  and  Complexity  focused  on  theoret¬ 
ical  and  applied  research  in  field  of  multiagent  planning  in  dynamic  environments.  The 
objective  of  the  project  was  to  connect,  both  formally  and  empirically,  the  developments 
in  domain-specific  multiagent  planning  and  the  concepts  of  generic,  domain-independent 
problem  solving  and  propose  solutions  and  techniques  robust  to  uncertainty,  dynamism  and 
non-determinism  of  environments  modeled  closely  to  real  world.  The  project  was  planned 
for  three  years.  In  Y3,  it  was  prolonged  to  43  months ,  that  is  till  the  end  of  2015. 

This  final  report  summarizes  general  overview  of  the  project,  achievements  of  the  project 
mostly  in  the  form  of  published  research  work,  and  describes  the  demonstrator. 

The  research  work  in  the  project  comprised  of  four  accepted  and  one  submitted  jour¬ 
nal  publication  aiming  at  advanced  techniques  for  multiagent  plan  repair,  simple  regret 
optimization  in  online  planning,  oversubscription  planning,  and  a  submitted  journal  arti¬ 
cle  on  novel  type  of  multi-heuristic  search  for  multiagent  planning  together  with  overview 
of  the  Multiagent  Distributed  and  Local  Asynchronous  (MADLA)  planner.  Furthermore, 
eight  accepted  papers  at  the  top  artificial  intelligence  and  planning  conferences  focused  on 
fault  tolerant  planning  and  interruptible  exploration  technique  usable  in  Monte-Carlo  tree 
search  algorithms.  The  submitted  and  accepted  workshop  papers  provided  good  ground 
for  valuable  discussion  at  the  specific  research  forums  and  propagated  the  work. 

In  the  third  year  of  the  project,  we  have  organized  the  first  international  Competition  of 
Distributed  and  Multiagent  Planners  (CoDMAP)  which  helped  to  unify  implementation  of 
the  multiagent  planners  in  the  multiagent  planning  research  community  and  to  propagate 
multiagent  planning  in  general. 

The  project  demonstrator  developed  in  the  first  and  last  years  of  the  project  is  based  on 
a  simulation  system  developed  in  the  Tactical  AgentFly  and  Tactical  AgentScout  projects 
(funded  previously  by  US  ARMY,  CERDEC).  The  tactical  environment  used  is  a  high- 
fidelity  multiagent  model  which  utilizes  the  distributed  multiagent  planner  and  integrates 
the  designed  multiagent  online  planning  algorithm  for  static  and  dynamic  tactical  scenarios. 

The  report  is  closed  up  with  the  publications  providing  details  on  the  particular  problems 
solved  during  the  project. 
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1  Overview 


Achieving  joint  objectives  by  teams  of  cooperative  planning  agents  requires  significant 
reasoning  but  also  coordination  and  communication  efforts  especially  in  case  of  planning 
in  dynamic  environments.  The  robustness  of  the  resulting  plans  with  respect  to  unforeseen 
changes  is  a  key  property.  The  research  fields  tackling  such  problems  are  uncertainty 
planning,  continuous  and  online  planning,  plan  repairing  and  probabilistic  planning.  This 
project  focuses  on  study  of  various  approaches  from  these  fields  and  the  key  objective  is  to 
extend  the  approaches  to  multiagent  setting  and  therefore  to  area  of  distributed  problem 
solving  in  general.  The  research  targets  comprise  of: 

RT1:  Formal  modeling.  Formalization  of  multiagent  planning  problems  towards  robust¬ 
ness  against  the  uncertainty  and  dynamism  in  the  environment.  Its  mathematical 
models,  specification  of  descriptive  languages  for  representing  the  problems  and  for¬ 
mal  study  of  computational  complexity,  interaction  stability  and  other  properties  of 
robust  multi-agent  planning  and  multi- agent  plans. 

RT2:  State  of  the  art  study.  Study  prior  art  for  existing  multiagent  planning  approaches 
especially  in  fields  of  uncertainty  planning,  continuous  and  online  planning,  plan 
repairing  and  probabilistic  planning  and  select/extend  the  approach  best  fitting  the 
problem  of  robust  multiagent  planning. 

RT3:  Algorithm  design.  Research  review  of  existing  planning  and  coordination  algo¬ 
rithms  matching  the  multiagent  planning  problem  definition.  Extension  of  existing 
planning  methods  (heuristic-based,  sampling-based,  plan  adaptation- based)  for  ro¬ 
bust  multiagent  planning.  Suggest  and  develop  various  multiagent  planning  methods 
optimizing  the  selected  planning  criteria  of  the  robust  planning. 

RT4:  Theoretical  analysis.  Perform  theoretical  analysis  of  complexity  metrics  as  com¬ 
putational  complexity  and  communication  complexity  of  the  developed  methods, 
study  relationship  between  robust  planning  and  multiagent  planning  techniques. 

RT5:  Empirical  Analysis.  Experimental  analysis  of  the  properties  of  the  designed  plan¬ 
ning  methods,  validation  of  the  theoretical  results,  analysis  of  their  practical  usability 
and  generality.  For  this  purpose  we  will  develop  a  set  of  benchmark  problems  that  will 
be  used  for  analysis  of  the  performance  of  the  algorithms  and  deploy  the  algorithms 
into  a  high-fidelity  simulation.  The  simulation  system  will  be  based  on  the  Tacti¬ 
cal  AgentFly  and  Tactical  AgentScout  projects  (funded  previously  by  US  ARMY, 
CERDEC)  as  an  experimental  testbed.  Tactical  Environment  is  a  high-fidelity,  large 
scale  multiagent  model  of  surveillance  and  tracking  missions  executed  by  a  fleet  of 
unmanned  aerial  vehicles. 
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2  Achievements 


The  research  achievements  (RT1,  RT2,  RT3  and  RT4)  of  the  project  were  reported  in 
form  of  4  accepted  and  1  submitted  journal  articles,  8  accepted  conference  papers  and 
several  workshop  papers.  For  the  empirical  analysis  of  RT5,  we  organized  a  competition  of 
distributed  and  multiagent  planners  and  developed  a  demonstrator  utilizing  the  techniques 
described  in  the  respective  papers.  This  section  summarizes  the  publications,  gives  an 
overview  of  the  competition,  and  describes  the  demonstrator. 


2.1  Journal  Publications 

Simple  Regret  Optimization  in  Online  Planning  for  Markov  Decision  Processes  In  on¬ 
line  planning  in  Markov  decision  processes  (MDPs),  the  agent  focuses  on  its  current  state 
only,  deliberates  about  the  set  of  possible  policies  from  that  state  onwards  and,  when  inter¬ 
rupted,  uses  the  outcome  of  that  exploratory  deliberation  to  choose  what  action  to  perform 
next.  The  performance  of  algorithms  for  online  planning  is  assessed  in  terms  of  simple  re¬ 
gret,  which  is  the  agent’s  expected  performance  loss  when  the  chosen  action,  rather  than 
an  optimal  one,  is  followed.  The  state-of-the-art  algorithms  for  online  planning  in  gen¬ 
eral  MDPs  were  either  best  effort,  or  guaranteed  only  polynomial-rate  reduction  of  simple 
regret  over  time.  We  have  introduced  a  new  Monte-Carlo  tree  search  algorithm,  BRUE, 
that  guarantees  exponential-rate  reduction  of  simple  regret  and  error  probability.  This 
algorithm  is  based  on  a  simple  yet  non-standard  state-space  sampling  scheme,  MCTS2e,  in 
which  different  parts  of  each  sample  are  dedicated  to  different  exploratory  objectives.  Our 
empirical  evaluation  showed  that  BRUE  not  only  provides  superior  performance  guaran¬ 
tees,  but  is  also  very  effective  in  practice  and  favorably  compares  to  state-of-the-art  online 
planning  techniques. 


Deterministic  Oversubscription  Planning  as  Heuristic  Search:  Abstractions  and  Refor¬ 
mulations  The  objective  of  oversubscription  planning  (OSP)  is  to  achieve  as  valuable  as 
possible  subset  of  goals  within  a  fixed  allowance  of  the  total  action  cost.  Tracing  the  key 
sources  of  progress  in  classical  planning,  we  identified  a  severe  lack  of  effective  domain- 
independent  approximations  for  OSP.  With  our  focus  on  optimal  planning,  two  classes  of 
approximation  techniques  have  been  found  especially  useful  in  the  context  of  optimal  classi¬ 
cal  planning:  those  based  on  state-space  abstractions  and  these  based  on  logical  landmarks 
for  goal  reachability.  The  question  we  studied  was  whether  some  similar-in-spirit,  yet  pos¬ 
sibly  mathematically  different,  approximation  techniques  can  be  developed  for  OSP.  In  the 
context  of  abstractions,  we  defined  the  notion  of  additive  abstractions  for  OSP,  study  the 
complexity  of  deriving  effective  abstractions  from  a  rich  space  of  hypotheses,  and  revealed 
some  substantial,  empirically  relevant  islands  of  tractability.  Our  empirical  evaluation 
confirmed  the  effectiveness  of  the  proposed  techniques. 
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Domain-independent  Multiagent  Plan  Repair  Achieving  joint  objectives  in  distributed 
domain-independent  planning  problems  by  teams  of  cooperative  agents  requires  significant 
coordination  and  communication  efforts.  For  systems  facing  a  plan  failure  in  a  dynamic 
environment,  arguably,  attempts  to  repair  the  failed  plan  in  general,  and  especially  in  the 
worst-case  scenarios,  do  not  straightforwardly  bring  any  benefit  in  terms  of  time  complex¬ 
ity.  However,  in  multiagent  settings,  the  communication  complexity  is  of  a  much  higher 
importance,  possibly  a  high  communication  overhead  might  even  be  prohibitive  in  certain 
domains. 

We  have  formally  introduced  a  notion  of  multiagent  plan  repair  problem.  Building  upon 
the  formal  treatment,  we  have  designed  three  algorithms  for  multiagent  plan  repair  reducing 
the  problem  to  specialized  instances  of  the  multiagent  planning  problem.  Finally,  we  have 
presented  an  experimental  validation,  results  of  which  showed  that  in  decentralized  systems, 
where  frequent  coordination  is  required  to  achieve  joint  objectives,  attempts  to  repair  failed 
multiagent  plans  leads  to  lower  communication  overhead  than  replanning  from  scratch. 

Multiagent  Plan  Repair  by  Combined  Prefix  and  Suffix  Reuse  The  plan  repairing  tech¬ 
niques  from  the  previous  article  were  generalized  and  the  generalization  extensively  exper¬ 
imentally  and  theoretically  analyzed. 

Provided  that  agents  act  in  an  uncertain  and  dynamic  environment,  their  plans  can 
fail.  The  straightforward  approach  to  recover  from  such  situations  is  to  compute  a  new 
plan  from  scratch,  that  is  to  replan.  Even  though,  in  a  worst  case,  plan  repair  or  plan 
re-use  does  not  yield  an  advantage  over  replanning  from  scratch,  there  is  a  sound  evidence 
from  practical  use  that  approaches  trying  to  repair  the  failed  original  plan  can  outperform 
replanning  in  selected  problems.  One  of  the  possible  plan  repairing  techniques  is  based  on 
preservation  of  fragments  of  the  older  plans.  The  article  theoretically  analyzed  complexity 
of  plan  repairing  approaches  based  on  preservation  of  fragments  of  the  original  plan  and 
experimentally  studied  three  practical  aspects  affecting  its  efficiency  in  various  multiagent 
settings  in  a  generalized  multiagent  plan  repairing  scheme. 

We  focused  both  on  the  computational,  as  well  as  the  communication  efficiency  of  plan 
repair  in  comparison  to  replanning  from  scratch  and  we  reported  on  the  influence  of  the 
following  properties  on  the  efficiency  of  plan  repair:  (1)  the  number  of  involved  agents  in 
the  plan  repairing  process,  (2)  inter-dependencies  among  the  repaired  actions,  and  finally 
(3)  particular  modes  of  re-use  of  the  older  plans. 

The  MADLA  Planner:  Multiagent  Planning  by  Combination  of  Distributed  and  Local 
Heuristic  Search  We  have  prepared  an  Artificial  Intelligence  Journal  submission  sum¬ 
marizing  and  extending  the  work  on  the  Multiagent  Distributed  and  Local  Asynchronous 
(MADLA)  planner. 

Multiagent  planning  for  the  multiagent  STRIPS  model  requires  a  process  of  distributed 
plan  generation  for  a  cooperative  team  of  agents.  Heuristic  multiagent  state-space  search  is 
an  obvious  candidate  providing  scheme  both  for  agents’  local  search  and  inter- agent  proto- 
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col  for  distribution  of  the  search.  However,  distributed  heuristic  estimation,  application  of 
multi-heuristic  search  and  efficient  utilization  of  the  decentralized  computation  power  was 
still  an  open  challenge.  The  MADLA  planner  runs  a  distributed  variant  of  a  state-space 
forward-chaining  multi-heuristic  search  with  two  versions  of  a  well  known  FastForward 
relaxation  heuristic,  one  estimating  the  particular  agent’s  local  subproblem  and  another 
estimating  the  global  heuristic  values.  We  proposed  a  general  asynchronous  scheme  for 
such  multi-heuristic  multiagent  searches  and  provided  proofs  of  soundness  and  complete¬ 
ness.  The  asynchronism  allowed  efficient  utilization  of  agents’  computation  power  while 
waiting  for  responses  from  other  agents.  Also,  we  provided  a  novel  distribution  scheme 
for  the  FastForward  heuristic,  inspired  by  the  Set- Additive  variation  of  the  FastForward 
heuristic  and  with  lazy  computation  of  partial  relaxed  plans.  We  experimentally  compared 
the  proposed  multi-heuristic  scheme  and  the  two  used  heuristics  per  se.  The  results  showed 
the  proposed  solution  outperforms  search  with  either  one  of  the  heuristics  used  separately, 
positively  combining  benefits  of  both  heuristics.  In  the  detailed  experimental  analysis,  we 
showed  limits  of  the  planner  and  of  the  used  heuristics  based  on  particular  properties  of  the 
benchmark  domains.  In  a  comprehensive  set  of  multiagent  planning  domains  and  problems, 
we  demonstrated  that  the  MADLA  Planner  outperforms  all  state-of-the-art  MA-STRIPS 
multiagent  planners. 


2.2  Conference  Publications 

Monte-Carlo  Tree  Search:  To  MC  or  to  DP?  In  the  context  of  BRUE  research,  state- 
of-the-art  Monte-Carlo  tree  search  algorithms  can  be  parametrized  with  any  of  the  two 
information  updating  procedures:  Monte-Carlo-backup  (MC-)  and  Dynamic-Programming- 
backup  (DP-backup).  The  dynamics  of  these  two  procedures  is  very  different,  and  so 
far,  their  relative  pros  and  cons  have  been  poorly  understood.  Formally  analyzing  the 
dependency  of  MC-  and  DP-backups  on  various  parameters  of  Markov  Decision  Processes, 
we  revealed  numerous  important  issues  that  got  hidden  by  the  worst-case  bounds  on  the 
algorithm  performance,  and  reconfirmed  these  findings  by  a  systematic  experimental  test. 


On  MABs  and  Separation  of  Concerns  in  Monte-Carlo  Planning  for  MDPs  In  the  area 
of  online  planning  and  sampling  methods,  we  extended  the  proposed  algorithm  BRUE  to¬ 
wards  problems  described  as  Markov  Decision  Processes  with  their  special  case  of  stochastic 
multi-armed  bandit  problems.  We  analyzed  three  state-of-the-art  Monte-Carlo  tree  search 
algorithms:  UCT,  BRUE,  and  MaxUCT.  Using  the  outcome,  we  (i)  introduced  two  new 
MCTS  algorithms,  MaxBRUE,  which  combines  uniform  sampling  with  Bellman  backups, 
and  MpaUCT,  which  combines  UCB1  with  a  novel  backup  procedure,  (ii)  analyzed  them 
formally  and  empirically,  and  (iii)  showed  how  MCTS  algorithms  can  be  further  stratified 
by  an  exploration  control  mechanism  that  improves  their  empirical  performance  without 
harming  the  formal  guarantees. 


Abstractions  for  Oversubscription  Planning  In  deterministic  oversubscription  planning 
(OSP),  the  objective  is  to  achieve  an  as  valuable  as  possible  subset  of  goals  within  a 
fixed  allowance  of  the  total  action  cost.  Although  numerous  applications  in  various  fields 
share  this  objective,  no  substantial  algorithmic  advances  have  been  made  beyond  the  very 
special  settings  of  net-benefit  optimization.  Tracing  the  key  sources  of  progress  in  classical 
planning,  we  have  identified  a  severe  lack  of  domain-independent  approximations  for  OSP, 
and  started  with  investigating  the  prospects  of  abstraction  approximations  for  this  problem. 
In  particular,  we  have  defined  the  notion  of  additive  abstractions  for  OSP,  studied  the 
complexity  of  deriving  effective  abstractions  from  a  rich  space  of  hypotheses,  and  revealed 
substantial,  empirically  relevant  islands  of  tractability. 

Landmarks  in  Oversubscription  Planning  Another  extension  of  the  OSP  direction  was 
pursuing  of  exploitation  of  concept  of  landmarks  from  classical  planning.  We  developed  a 
framework  for  exploiting  such  landmarks  in  heuristic-search  OSP.  We  showed  how  standard 
landmarks  of  certain  classical  planning  tasks  can  be  compiled  into  the  OSP  task  of  interest, 
resulting  in  an  equivalent  OSP  task  with  a  lower  budget,  and  thus  with  a  smaller  search 
space.  We  then  showed  how  such  landmark-based  task  enrichment  can  be  combined  in 
a  mutually  stratifying  way  with  the  Best-First-Branch-and-Bound  search  used  for  OSP 
planning.  Our  empirical  evaluation  confirmed  the  effectiveness  of  the  proposed  landmark- 
based  budget  reduction  scheme. 

Relaxation  Heuristics  for  Multiagent  Planning  In  multiagent  planning,  heuristics  signif¬ 
icantly  improve  efficiency  of  search-based  planners,  similarly  to  classical  planners.  Heuris¬ 
tics  based  on  solving  a  relaxation  of  the  original  planning  problem  were  already  intensively 
studied  in  context  of  classical  planning.  In  particular,  a  popular  heuristics  is  the  delete 
relaxation,  where  all  delete  effects  of  actions  are  omitted.  We  have  designed  and  presented 
a  unified  view  on  distribution  of  delete  relaxation  heuristics  for  multiagent  planning.  We 
experimentally  evaluated  properties  of  the  distribution  of  additive,  max  and  Fast-Forward 
relaxation  heuristics  in  the  implementation  of  the  early  versions  of  the  MADLA  planner. 
The  best  performing  distributed  relaxation  heuristics  favorably  compared  to  a  state-of- 
the-art  multiagent  STRIPS  planner  in  terms  of  benchmark  problem  coverage.  Finally,  we 
analyzed  impact  of  limited  agent  interactions  by  means  of  recursion  depth  of  the  heuristic 
estimates. 

Fault  Tolerant  Planning:  Complexity  and  Compilation  In  the  context  of  modeling  and 
reasoning  about  agent  actions,  contingent  and  classical  planning  can  often  be  respectively 
seen  as  adopting  “extreme  pessimism”  and  “extreme  optimism”  about  the  action  outcomes. 
For  many  everyday  scenarios  of  human  reasoning  (and  thus  for  many  types  of  autonomous 
systems),  both  these  approaches  are  just  too  extreme.  We  have  examined  a  planning  model 
that  interpolates  between  classical  and  contingent  planning  via  tolerance  to  arbitrary  n 
faults  occurring  during  plan  execution.  We  have  shown  that  an  important  fragment  of 
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this  fault  tolerant  planning  (FT-planning)  exhibits  both  an  appealing  solution  structure, 
as  well  as  appealing  worst-case  time-complexity  properties.  We  have  also  showed  that  such 
FT-planning  tasks  can  be  efficiently  compiled  into  classical  planning  as  long  as  the  number 
of  possible  faults  per  operator  is  bounded  by  a  constant,  and  we  have  shown  that  this 
compilation  can  be  attractive  in  practice. 

On  Combinatorial  Actions  and  CMABs  with  Linear  Side  Information  Since  classical  on¬ 
line  planning  algorithms  are  typically  a  tool  of  choice  for  dealing  with  sequential  decision 
problems  in  combinatorial  search  spaces,  as  in  multiagent  setting,  many  such  problems, 
however,  also  exhibit  combinatorial  actions,  yet  standard  planning  algorithms  do  not  cope 
well  with  this  type  of  “the  curse  of  dimensionality.”  Based  on  previous  work  on  combinato¬ 
rial  multi-armed  bandit  (CMAB)  problems,  we  proposed  a  novel  CMAB  planning  scheme, 
as  well  as  two  specific  instances  of  this  scheme,  dedicated  to  exploiting  what  is  called  linear 
side  information.  Using  a  representative  strategy  game  as  a  benchmark,  we  showed  that 
the  resulting  algorithms  very  favorably  compete  with  the  state-of-the-art. 

On  Interruptible  Pure  Exploration  in  Multi-Armed  Bandits  Interruptible  pure  explo¬ 
ration  in  multi-armed  bandits  (MABs)  is  a  key  component  of  Monte-Carlo  tree  search  al¬ 
gorithms  for  sequential  decision  problems.  We  introduced  Discriminative  Bucketing  (DB), 
a  novel  family  of  strategies  for  pure  exploration  in  MABs,  which  allows  for  adapting  re¬ 
cent  advances  in  non-interruptible  strategies  to  the  interruptible  setting,  while  guarantee¬ 
ing  exponential-rate  performance  improvement  over  time.  Our  experimental  evaluation 
demonstrated  that  the  corresponding  instances  of  DB  favorably  compete  both  with  the 
currently  popular  strategies  UCB1  and  e-Greedy,  as  well  as  with  the  conservative  uniform 
sampling. 

2.3  Workshop  Publications 

Fast-Forward  Heuristic  for  Multiagent  Planning  Use  of  heuristics  in  search-based  domain- 
independent  deterministic  multiagent  planning  is  as  important  as  in  classical  planning. 
We  have  proposed  a  formal  and  an  algorithmic  adaptation  of  a  well-known  heuristic  Fast- 
Forward  into  multiagent  planning.  Such  treatment  is  important  as  it  solves  challenges  in 
decentralization  of  this  and  other  heuristics  based  on  relaxation  of  the  original  planning 
problem.  Such  decentralization  enables  global  heuristic  estimates  to  be  computed  without 
exposing  local  information.  Additionally,  since  Fast-Forward  heuristic  is  based  on  relaxed 
planning,  we  have  proposed  a  multiagent  approach  for  building  factored  relaxed  planning 
graphs  among  the  agents. 

How  to  Repair  Multiagent  Plans:  Experimental  Approach  Deterministic  domain- 
independent  multiagent  planning  is  an  approach  to  coordination  of  cooperative  agents 
with  joint  goals.  Provided  that  the  agents  act  in  an  imperfect  environment,  such  plans 
can  fail.  One  of  the  possible  techniques  how  to  recover  from  such  failures  is  to  repair  the 
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original  plan.  One  family  of  such  repairing  techniques  is  based  on  preservation  of  plan 
parts.  We  have  experimentally  studied  three  aspects  affecting  efficiency  of  plan  repairing 
approaches  based  on  preservation  of  fragments  of  the  original  plan  in  a  multiagent  setting 
with  focus  both  on  the  computational,  as  well  as  the  communication  efficiency  of  plan 
repair  in  comparison  to  replanning  from  scratch. 

On  Robustness  of  CMAB  Algorithms:  Experimental  Approach  Similarly  to  the  most 
prominent  approaches  for  online  planning  with  polynomial  number  of  possible  actions, 
state-of-the-art  algorithms  for  online  planning  with  exponential  number  of  actions  are  based 
on  Monte-Carlo  sampling.  However,  without  a  proper  selection  of  the  appropriate  subset  of 
actions  these  techniques  cannot  be  used.  The  most  recent  algorithms  tackling  this  problem 
utilize  an  assumption  of  linearity  with  respect  to  the  combinations  of  the  actions.  In  this 
paper,  we  experimentally  analyzed  robustness  of  two  state-of-the-art  algorithms  NMC  and 
LSI  for  online  planning  with  combinatorial  actions  in  various  setups  of  Real-Time  and 
Turn- Taking  Strategy  games. 

2.4  Competition  of  Distributed  and  Multiagent  Planners  (CoDMAP) 

As  a  part  of  the  workshop  on  Distributed  and  Multiagent  Planning  (DMAP)  at  the  In¬ 
ternational  Conference  on  Automated  Planning  and  Scheduling  (ICAPS)  2015,  we  have 
organized  a  competition  in  distributed  and  multiagent  planning.  The  main  aims  of  the 
competition  were  to  consolidate  the  planners  in  terms  of  input  format;  to  promote  devel¬ 
opment  of  multiagent  planners  both  inside  and  outside  of  the  multiagent  research  commu¬ 
nity;  and  to  provide  a  proof-of-concept  of  a  potential  future  multiagent  planning  track  of 
the  International  Planning  Competition  (IPC).  In  this  section  we  summarize  course  and 
highlights  of  the  competition. 

Introduction  and  Aims  of  CoDMAP  Various  forms  of  multiagent  planning  have  recently 
found  their  way  to  the  automated  planning  research  community,  nevertheless,  there  was 
no  competition  of  multiagent  planners  in  the  tradition  of  IPC  yet.  As  the  organizers  of 
DMAP’15,  we  have  decided  to  run  a  co-located  competition  of  multiagent  planners. 

We  chose  an  approach  similar  to  that  of  classical  planning  competitions,  to  start  with 
the  smallest  possible  subset  of  features  and  possibly  extend  them  in  the  future.  One  of 
the  main  focuses  of  the  competition  design  was  to  allow  as  many  existing  planners  as 
possible  to  enter  without  large-scale  modifications.  In  order  to  foster  our  awareness  of  the 
existing  planners  and  their  possible  extensions,  we  have  conducted  a  public  poll1.  Out  of 
the  poll  and  other  considerations  arose  three  main  restrictions  of  the  multiagent  planning 
model:  deterministic,  non-durative  actions,  full  observability  (with  respect  to  privacy), 
cooperative  agents  and  offline  planning. 

1The  poll  form  can  be  found  at:  http://bit.ly/lIsNoqY 
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Figure  1:  Comparison  of  IPC  and  CoDMAP  tracks. 


Formalism  and  Input  Language  A  crucial  point  of  the  competition  was  to  determine 
a  formalism  and  an  input  language.  For  the  formalism,  we  have  chosen  MA-STRIPS  [1] 
for  its  simplicity  and  wide  acceptance  among  existing  planners.  MA-STRIPS  extends  the 
STRIPS  formalism  with  two  concepts:  (i)  factorization  and  (ii)  privacy.  Factorization  is 
defined  in  the  planning  problem  and  prescribes  what  STRIPS  actions  can  be  executed  by 
which  agents.  A  STRIPS  fact  is  private  if  it  is  not  affected  and  cannot  affect  more  than 
one  agent. 

Following  the  minimalistic  extension  of  STRIPS  to  MA-STRIPS,  we  wanted  a  simple 
extension  of  the  common  planning  language  PDDL  [7]  towards  multiagent  planning,  also 
compatible  with  MA-STRIPS.  After  analysis  of  candidate  languages,  we  have  decided  to 
extend  MA-PDDL  [6].  The  extension2  came  in  two  flavors,  a  factored  description,  which 
allowed  the  definition  of  separate  domain  and  problem  description  for  each  agent,  and  an 
unfactored  description,  which  allowed  the  definition  of  factorized  privacy  in  a  single  domain 
and  problem  description.  Additionally,  our  generalized  definition  of  privacy  was  enough  to 
comprise  MA-STRIPS  privacy,  but  allowed  for  more  general  definitions,  possibly  usable  in 
future  multiagent  planning  competitions. 

Competition  Tracks  A  success  of  a  planning  competition  is  determined  to  large  extent 
by  the  number  of  contestants  and  as  there  was  no  historical  experience  from  previous  mul¬ 
tiagent  planning  competitions,  we  wanted  to  open  the  competition  to  the  widest  possible 

2The  extended  BNF  can  be  found  at  http:/ /agents. fel. cvut.cz/codmap/MA-PDDL-BNF. pdf 
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audience.  A  survey  of  literature  on  multiagent  planners  together  with  the  competition  poll 
provided  enough  information  to  set  the  rules  for  the  competition  so  that  an  ample  amount 
of  already  existing  multiagent  planners  could  compete  and  still  the  key  motivations  of  the 
competition  remained  satisfied. 

The  fundamental  discriminator  of  current  multiagent  planners  is  whether  they  can  work 
distributively  on  multiple  interconnected  physical  machines,  or  not.  To  accommodate 
planners  running  in  either  modes,  the  competition  was  split  in  two  tracks  (see  Figure  1): 

•  Centralized  Track,  aiming  for  maximal  compatibility  with  classical  IPC  and  exist¬ 
ing  multiagent  planners;  the  input  of  a  planner  was  either  unfactored  or  factored 
MA-PDDL;  the  planners  run  on  a  single  machine,  with  no  other  restrictions  or  re¬ 
quirements;  communication  was  not  restricted;  privacy  was  not  enforced. 

•  Distributed  Track,  much  more  strict  aiming  for  a  proper  multiagent  setting;  the  input 
was  limited  to  distributed  factored  MA-PDDLs  for  each  agent;  planners  run  distribu¬ 
tively  on  a  grid  of  machines;  planners  had  to  communicate  over  TCP/IP;  preservation 
of  privacy  of  the  local  data  was  required. 

Evaluation  and  Benchmarks  Each  run  of  a  planner  in  the  competition  was  restricted  to 
30  minutes  on  4  computational  cores  and  8GB  per  machine. 

The  metrics  used  to  compare  the  planners  were  coverage  (number)  of  solved  problems, 
IPC  Score  over  the  plan  quality,  and  IPC  score  over  the  planning  time.  In  the  distributed 
track,  the  plan  quality  was  evaluated  both  in  terms  of  total  cost  (sum  of  costs  of  all  used 
actions)  and  makespan  (the  maximum  timestep  of  the  plan  if  executed  in  parallel). 

The  planners  were  evaluated  over  a  set  of  12  benchmark  domains.  The  domains  were 
motivated  by  important  and  interesting  real- world  problems  and/or  by  problems  expos¬ 
ing  and  testing  theoretical  features  of  the  planners.  We  used  domains  from  literature  on 
multiagent  planning:  blocksworld,  depot,  driverlog,  elevators08,  logisticsOO, 
ROVERS,  SATELLITES,  SOKOBAN,  WOODWORKING,  and  ZENOTRAVEL,  each  with  20  prob¬ 
lem  instances,  with  varying  size,  number  of  objects,  constants,  agents,  and  thus  complexity. 
Additionally,  we  have  added  two  novel  domains  inspired  by  well-known  multiagent  prob¬ 
lems,  not  modeled  in  MA-STRIPS  or  MA-PDDL  previously:  taxi  and  wireless.  The 
first  one  was  a  model  of  on-denrand  transport  by  taxis  in  a  city,  while  the  other  modeled 
a  group  of  communicating  autonomous  nodes  in  a  wireless  sensor  network. 

The  validity  and  quality  of  plans  was  evaluated  using  the  VAL3  tool,  which  can  handle 
parallel  plans  and  performs  the  mutex  checks. 

Selected  Results  For  the  centralized  track,  we  have  received  12  planners  in  17  configu¬ 
rations  prepared  by  8  teams.  For  the  distributed  track  6  configurations  of  3  planners  by 
3  teams.  Complete,  detailed,  and  interactive  results  including  detailed  description  of  the 

3http:  / /www. inf.kcl.ac.uk/research/groups/planning 
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Coverage 

centralized  track 

distributed  track 

1.-2. 

ADP  m 

222 

1. 

PSM  k. 

180 

3. 

MAP-LAPKT  hi 

216 

2. 

MAPlan  ^ 

174 

4. 

CMAP  = 

210 

3. 

MH-FMAP  = 

107 

Table  1:  Best  performing  planners  in  the  metrics  of  solved  problems  out  of  overall  240 
benchmarks. 


planners  and  their  authors  can  be  found  on  the  official  competition  webpage4,  selected 
results  are  presented  in  Table  1. 

The  winning  planners  of  the  centralized  track  were  two  variants  of  the  ADP  (Agent 
Decomposition-based  Planner)  planner  based  on  the  idea  of  automatic  decomposition  of 
classical  planning  problems  to  multiple  agents,  the  MAP-LAPKT  planner  based  on  solving 
of  an  encoded  multiagent  problem  by  a  classical  planner  and  CMAP  based  on  subgoal 
extraction  and  factored  compilation  to  classical  planning. 

The  distributed  track  won  a  variant  of  the  PSM  planner  based  on  intersection  of  finite 
automatons  representing  sets  of  agents’  local  plans  coined  Planning  State  Machines  (PSM). 
Second  place  occupied  MAPlan,  a  distributed  heuristic  search  planner.  The  multi-heuristic 
partial-order  forward-chaining  planner  MH-FMAP  placed  as  third. 

2.5  Demonstrator 

The  demonstrator  (RT5)  is  based  on  the  Tactical  Environment  [5]  previously  developed 
within  a  series  of  CERDEC-funded  projects  Tactical  AgentFly  and  Tactical  AgentScout. 
The  environment  features  a  high-fidelity  simulation  with  full  3D  physics  and  visualiza¬ 
tion.  We  used  this  toolkit  to  create  challenging  and  dynamic  environment  to  test  and 
demonstrate  the  developed  multiagent  (online)  planning  techniques. 

The  demonstration  consists  of  a  mission  domain  specification  and  application  of  two  mul¬ 
tiagent  algorithms  for  sequential  decision  making  to  provide  behavior  for  the  agents  repre¬ 
senting  the  tactical  units.  The  Static  Enemy  Scenario,  uses  the  MADLA  planner  [9,  10]  and 
in  the  Dynamic  Enemy  Scenario,  the  Linear  Side  Information  (LSI)  utilizing  algorithm  [8] 
is  used  for  both  allied  and  enemy  units. 

Scenario  Overview 

The  used  scenario  comprise  of  an  urban  environment  (a  small  village)  with  three  possible 
types  of  units — a  convoy  (transport  unit),  infantry  (combat  units),  and  guards  (defense 
units  represented  by  the  AFV  Stryker).  The  convoy  and  guards  are  wheeled,  infantry 
move  on  foot.  There  is  a  set  of  waypoints  which  can  be  connected  either  by  a  road  or  by  a 

4CoDMAP  results:  http://agents.cz/codmap/results 
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Figure  2:  A  2D  and  3D  visualization  of  the  simulation  environment. 


path.  The  wheeled  units  can  move  only  on  roads,  while  units  moving  on  feet  can  use  both 
roads  and  paths.  Some  of  the  waypoints  are  also  connected  by  line-of-sight  (road  or  path 
does  not  imply  line-of-sight)  among  buildings.  The  goal  is  to  move  the  convoy  safely  from 
an  initial  waypoint  to  another  goal  waypoint. 

In  Figure  2,  we  show  two  examples  of  2D  and  3D  visualizations  of  the  simulated  envi¬ 
ronment  (in  this  particular  case  with  static  enemies).  In  the  2D  top- view,  execution  of  a 
prepared  mission  plan  has  just  been  started.  The  infantry  agent  is  approaching  the  enemy 
infantry  from  a  blind  angle  in  order  to  disarm  it.  The  convoy  and  the  defending  AFV 
vehicle  (denoted  as  a  guard)  are  prepared  to  follow  the  plan  once  the  area  has  been  cleared. 
In  the  3D  screen-shot,  we  show  the  convoy  vehicle,  the  AFV  Stryker  with  visualized  plan 
and  the  infantry  unit  in  the  distance  (green  person  avatar).  The  visualization  is  based  on 
the  original  Tactical  Environment  visualization. 

Mission  Domain  Specification 

The  mission  domain  (coined  convoy)  models  the  scenario  in  a  classical  planning  language. 
Two  simple  examples  of  problems  in  the  convoy  domain  are  shown  in  Figure  3.  The  red 
locations/connections  are  roads,  the  blue  locations/connections  are  paths  and  the  dotted 
lines  show  line-of-sight.  C  denotes  the  convoy,  G  denotes  the  goal  location,  gl  and  g2 
denote  the  guards,  il  and  i2  denote  the  infantry  and  el  and  e2  denote  the  enemy  infantry 
units.  The  first  example  comprise  three  agents  (C,  il,  gl),  the  second  five  agents  (C,  il,  i2, 
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Figure  3:  Two  simple  problems  in  the  convoy  demonstration  domain  ( convoy-a3  on  left 
and  convoy-a5  on  right). 


gl,  g2).  In  these  particular  cases,  the  enemies  are  static,  hence  not  needed  to  be  modeled 
as  agents.  As  it  will  be  shown  later,  in  the  scenario  with  dynamic  enemies,  both  sides  of 
the  conflict  are  modeled  equally,  thus  all  units  are  represented  by  agents  with  antagonistic 
goals. 

Figure  4  shows  the  listing  of  the  domain  definition  in  Planning  Domain  Description 
Language  (PDDL)  [7].  First  of  all,  the  unit  types  (wunit  denotes  a  wheeled  unit)  are 
defined,  then  the  connection  predicates  (connected-w  is  a  road,  connected-f  is  a  path),  line- 
of-sight,  locations  of  units,  activity  of  enemies,  guarding/ watching  and  danger  predicates. 
After  the  predicates,  the  definition  of  actions  follows.  First  three  actions  are  move  actions 
of  the  three  distinct  unit  types.  Notice  that  a  convoy  unit  cannot  move  to  a  waypoint 
which  is  not  guarded,  the  guard  unit  cannot  move  when  it  is  guarding  and  both  guard  unit 
and  infantry  unit  cannot  move  to  a  waypoint  which  is  endangered  by  some  active  enemy. 
Next  action  is  the  move-attack  action  which  moves  an  infantry  unit  to  a  location  of  an 
enemy  unit  and  suppresses  the  enemy  unit  at  the  same  time.  Last  two  actions  are  used  by 
the  guard  unit  to  guard  and  stop  guarding  (unguard)  a  specific  waypoint. 

A  small  convoy  problem  definition  is  listed  in  Figure  5.  In  the  problem,  there  are  six 
waypoints  and  one  unit  of  each  type.  The  initial  state  corresponds  to  the  left  illustration 
in  Figure  3.  Notice  that  both  wwp3  and  wwp4  are  endangered  by  the  enemy  unit. 

Solution  to  the  convoy-a3  problem  is  listed  in  Figure  6.  According  to  the  solution,  the 
infantry  first  moves  to  suppress  the  enemy  unit.  Then  the  guard  moves  to  a  vantage  point 
(wwp6)  and  finally  the  convoy  is  moved  through  guarded  locations  to  the  goal.  Slightly 
more  complex  problem  is  depicted  in  the  right  illustration  of  Figure  3  ( convoy-a5 ).  The 
solution  is  listed  in  Figure  6.  Similarly  to  the  previous  solution,  the  enemies  are  first 
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(define  (domain  convoy) 

(: requirements  : strips  : adl  : typing) 

( : types 

wp  wunit  unit  -  object 
convoy  guard  -  wunit 
infantry  enemy  wunit  -  unit 

) 

(: predicates  (connected-f  ?wpl  -  wp  ?wp2  -  wp) 

(connected-w  ?wpl  -  wp  ?wp2  -  wp) 

(line-of-sight  ?wpl  -  wp  ?wp2  -  wp) 

(at  ?u  -  unit  ?wp  -  wp) 

(active  ?e  -  enemy) 

(guarding  ?g  -  guard) 

(danger  ?wp  -  wp  ?e  -  enemy) 

(guarded  ?wp  -  wp  ?g  -  guard) 

)  ~  ‘  . " 

( : action  move-c 

: parameters  (?c  -  convoy  ?from  -  wp  ?to  -  wp  ?g  -  guard) 

precondition  (and  (at  ?c  ?from)  (connected-w  ?from  ?to)  (guarded  ?to  ?g)) 

: effect  (and  (at  ?c  ?to)  (not  (at  ?c  ?from))) 

) 

(: action  move-g 

parameters  (?g  -  guard  ?from  -  wp  ?to  -  wp) 

precondition  (and  (at  ?g  ?from)  (connected-w  ?from  ?to)  (not  (guarding  ?g)) 
(forall  (?e  -  enemy)  (not  (and  (danger  ?to  ?e)  (active  ?e))))) 

: effect  (and  (at  ?g  ?to)  (not  (at  ?g  ?from))) 

) 

( : action  move-i 

parameters  (?i  -  infantry  ?from  -  wp  ?to  -  wp) 

precondition  (and  (at  ?i  ?from)  (connected-f  ?from  ?to)  (forall  (?e  -  enemy) 
(not  (and  (danger  ?to  ?e)  (active  ?e))))) 

: effect  (and  (at  ?i  ?to)  (not  (at  ?i  ?from))) 

) 

(: action  move-attack 

parameters  (?i  -  infantry  ?from  -  wp  ?to  -  wp  ?e  -  enemy) 
precondition  (and  (at  ?i  ?from)  (at  ?e  ?to)  (connected-f  ?from  ?to)) 

: effect  (and  (at  ?i  ?to)  (not  (at  ?i  ?from))  (not  (active  ?e))) 

) 

(: action  guard 

parameters  (?g  -  guard  ?from  -  wp  ?to  -  wp) 

precondition  (and  (at  ?g  ?from)  (line-of-sight  ?from  ?to)  (not  (guarding 
?g))  (forall  (?e  -  enemy)  (not  (and  (danger  ?to  ?e)  (active  ?e))))) 

: effect  (and  (guarded  ?to  ?g)  (guarding  ?g)) 

) 

(: action  unguard 

parameters  (?g  -  guard  ?wp  -  wp) 

precondition  (and  (guarding  ?g)  (guarded  ?wp  ?g)) 

: effect  (and  (not  (guarded  ?wp  ?g))  (not  (guarding  ?g))) 

) 

) 


Figure  4:  Definition  in  PDDL  for  the  convoy  domain. 
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(define  (problem  convoy- lg-li) 

( : domain  convoy) 

(: requirements  : adl  : strips  : typing) 

( : objects 
wwpl  -  wp 
wwp3  -  wp 
wwp4  -  wp 
wwp5  -  wp 
wp3  -  wp 
wp4  -  wp 
c  -  convoy 
gl  -  guard 
il  -  infantry 
el  -  enemy) 

( : init 

(connected-w  wwpl  wwp3)  (connected-w  wwp3  wwpl) 
(connected-w  wwp3  wwp4)  (connected-w  wwp4  wwp3) 
(connected-w  wwp4  wwp5)  (connected-w  wwp5  wwp4) 

(connected-f  wwpl  wwp3)  (connected-f  wwp3  wwpl) 
(connected-f  wwp3  wwp4)  (connected-f  wwp4  wwp3) 
(connected-f  wwp4  wwp5)  (connected-f  wwp5  wwp4) 
(connected-f  wwpl  wp4)  (connected-f  wp4  wwpl) 
(connected-f  wp4  wp3)  (connected-f  wp3  wp4) 

(line-of-sight  wp3  wwp3)  (line-of-sight  wwp3  wp3) 
(line-of-sight  wp3  wwp4)  (line-of-sight  wwp4  wp3) 
(line-of-sight  wwp5  wwp4)  (line-of-sight  wwp4  wwp5) 
(line-of-sight  wwp5  wwp3)  (line-of-sight  wwp3  wwp5) 

(at  c  wwpl) 

(at  gl  wwpl) 

(at  il  wwpl) 

(at  el  wp3) 

(active  el) 

(danger  wwp3  el) 

(danger  wwp4  el)) 

( :goal 

(and  (at  c  wwp4))) 

) 


Figure  5:  Problem  definition  in  PDDL  of  a  convoy  problem. 
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0:  move-i  il  wwpl  wp4 

1:  move-attack  il  wp4  wp3  el 

2:  move-g  gl  wwpl  wwp3 

3:  move-g  gl  wwp3  wwp4 

4:  move-g  gl  wwp4  wwp5 

5:  guard  gl  wwp5  wwp3 
6:  move-c  c  wwpl  wwp3  gl 
7 :  unguard  gl  wwp3 

8 :  guard  gl  wwp5  wwp4 
9:  move-c  c  wwp3  wwp4  gl 

(a) 


0:  move-i  il  wwpl  wwp2 

1:  move-i  il  wwp2  wpl 

2:  move-g  g2  wwpl  wwp2 

3:  move-i  i2  wwp5  wp2 

4:  move-attack  il  wpl  wp3  el 

5:  move-attack  i2  wp2  wwp6  e2 

6 :  move-g  g2  wwp2  wwp3 

7 :  move-g  g2  wwp3  wwp4 

8 :  move-g  gl  wwpl  wwp2 

9 :  move-g  gl  wwp2  wwp7 

10:  move-g  gl  wwp7  wwp8 

11:  move-g  g2  wwp4  wwp6 

12:  guard  gl  wwp8  wwp2 

13:  move-c  c  wwpl  wwp2  gl 

14 :  unguard  gl  wwp2 

15:  guard  gl  wwp8  wwp3 

16:  move-c  c  wwp2  wwp3  gl 

17 :  guard  g2  wwp6  wwp4 

18:  move-c  c  wwp3  wwp4  g2 

19:  unguard  g2  wwp4 

20 :  guard  g2  wwp6  wwp5 

21 :  move-c  c  wwp4  wwp5  g2 


(b) 

Figure  6:  Solution  to  the  convoy-a3  (a)  and  convoy-a5  (b)  problems 


suppressed,  then  the  guards  are  positioned  on  two  vantage  points  (wwp4  and  wwp8)  and 
finally  the  convoy  safely  moves  to  the  goal  position. 


Tactical  Planning  Recall  that  the  tactical  simulation  of  demonstrator  is  based  on  the 
Tactical  Environment  [5].  The  state  of  the  environment  is  represented  by  sets  of  state 
variable  containers  called  state  storages.  Each  state  storage  is  responsible  for  holding 
a  specific  part  of  the  current  state,  i.e. ,  all  state  storages  together  constitute  the  full 
description  of  the  current  state  of  the  simulated  environment.  All  dynamic  objects  in 
the  environment  are  denoted  as  simulation  entities.  A  simulation  entity  is  a  simulated 
embodiment  with  a  related  controlling  agent.  The  environment  model  is  accompanied  by 
a  functional  part,  describing  the  behavior  of  the  simulation.  To  ensure  properties  of  the 
tested  algorithms  during  the  implementation  process,  the  simulation  platform  facilitates 
construction  of  experiment  suites  allowing  execution  of  reproducible  experiments. 

The  Tactical  Environment  proposed  in  [5]  was  enriched  with  a  new  type  of  unit — a 
Domain- Independent  Multiagent  Planning  (DIMAP)  Physical  Agent.  Such  unit  represents 
any  kind  of  agent  which  has  the  capability  to  participate  on  the  multiagent  planning 
process  and  which  has  a  physical  representation  eventually  executing  the  plan.  The  agent’s 
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Figure  7:  Architecture  of  an  agent  for  (online)  multiagent  planning/decision  making  in 
the  tactical  environment  of  the  demonstrator.  The  architecture  for  the  static  and 
dynamic  enemy  scenarios  shares  the  same  structure,  the  only  (implementation) 
difference  is  in  naming  of  the  simulation  objects  (DIMAP  Physical  Game  Agents, 
Game  Sensors,  Game  Actuators,  etc.  for  the  units  in  the  scenario  with  dynamic 
enemies). 


architecture  is  depicted  in  Figure  7.  According  to  a  common  agent  architecture  of  the 
Tactical  Package,  the  agent  incorporates  several  sensors  and  actuators  in  a  hierarchical 
manner,  with  the  lowest  layer  being  a  storage  of  the  world  state. 

Let  us  examine  the  architecture  from  bottom  up.  In  the  storage  layer,  there  are  two 
storage  instances,  shared  by  all  agents.  The  Physical  State  Storage  is  similar  to  the  classical 
Tactical  Package  storage  and  contains  the  state  of  the  simulation  world,  i.e. ,  positions  and 
orientations  of  the  objects.  The  DIMAP  State  Storage  has  been  added  to  store  the  symbolic 
representation  of  the  world  used  by  planning.  Currently  we  use  a  Java-based  planner  and 
therefore  we  were  able  to  re-use  the  actual  data  structures — the  DIMAP  State  Storage 
contains  a  DIMAP  State,  which  is  then  modified  by  applied  DIMAP  actions. 

The  first  sensor/actuator  layer  comprises  of  two  sensor /actuator  pairs,  each  affecting 
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one  of  the  storage  instances  In  fact,  there  is  a  set  of  physical  sensors  and  actuators,  but 
in  this  level  of  abstraction  we  can  safely  treat  them  as  a  single  sensor  and  single  actuator. 
The  physical  actuators  perform  the  actions  in  the  simulation  world,  be  it  a  simple  action 
such  as  turning  of  an  enemy  towards  a  waypoint  it  endangers,  or  a  complex  action  such 
as  planning  a  path  using  trajectory  planner  and  executing  it.  The  physical  sensor  then 
reports  changes  in  the  world,  most  notably  that  the  movement  was  finished.  The  DIMAP 
sensor/actuator  pair  is  much  simpler — the  actuator  applies  given  DIMAP  Action  to  the 
current  DIMAP  State  and  the  sensor  reports  that  the  symbolic  world  state  has  changed  to 
all  agents. 

The  top  sensor/actuator  layer  contains  a  single  sensor  and  actuator,  both  combining  the 
symbolic  and  physical  views  of  the  world.  In  the  case  of  the  sensor,  the  combination  is 
trivial — the  sensor  aggregates  the  DIMAP  sensor  and  all  the  physical  sensors.  The  actuator 
is  slightly  more  intricate — as  we  intend  to  keep  the  architecture  domain-independent,  we 
needed  to  represent  various  actions  present  in  the  current  domain.  This  is  achieved  by 
maintaining  a  collection  of  Physical  Action  instances,  each  representing  a  type  of  DIMAP 
action  (determined  by  a  label,  corresponding  to  the  PDDL  operators),  all  having  a  common 
interface. 

Each  Physical  Action  uses  the  DIMAP  State  Actuator  to  modify  the  symbolic  world 
state,  according  to  the  particular  action  instance  (an  operator,  or  action  type,  can  have 
various  parameters  modifying  how  the  action  actually  changes  the  state).  In  addition  to 
that,  some  actions  may  have  side  effects  in  the  physical  world,  which  are  reflected  using  the 
physical  actuators.  In  some  situations,  the  application  of  the  action’s  side  effects  are  not 
as  trivial.  Take  for  instance  a  simple  movement  action  move-c-A-B  which  should  move  the 
convoy  c  from  location  A  to  location  B.  When  the  physical  aspect  of  the  action  is  applied, 
it  means  that  a  trajectory  is  planned  (may  take  some  time)  and  then  the  convoy  follows 
the  trajectory  in  order  to  reach  the  destination  B  (takes  some  time).  If  we  applied  the 
symbolic  effect  of  the  action  (which  is  something  like  at-c-B)  immediately,  the  symbolic 
representation  of  the  world  would  represent  a  world  state,  where  the  convoy  is  already 
at  its  destination,  but  the  physical  state  of  the  world  would  be  that  the  convoy  is  still 
planning  its  trajectory  or,  at  best,  is  on  its  way.  Therefore  we  need  to  apply  the  symbolic 
action  effect  only  when  the  physical  part  of  the  action  is  finished.  In  order  to  do  so,  the 
Physical  Action  has  access  to  the  DIMAP  Physical  Sensor  and  may  register  callbacks  for 
events  such  as  that  the  agent  has  reached  end  of  its  plan. 

At  the  topmost  layer  there  is  the  DIMAP  Physical  Agent  itself.  The  agent  incorporates 
the  decision  making  algorithm  (the  MADLA  planner  and  the  LSI  algorithm  in  the  Static 
and  Dynamic  Enemy  Scenarios  respectively) ,  the  communication  interface  and  the  DIMAP 
Physical  sensor /actuator  pair.  The  particular  decision  making  process  differs  according  to 
the  used  algorithm.  The  details  are  presented  in  the  following  two  subsections. 

Whenever  the  global  symbolic  DIMAP  state  is  changed,  which  is  detected  by  the  sensors 
in  the  lower  layers,  or  initially  when  the  initial  decision  is  made,  each  agent  checks  its 
decided  action  against  the  current  DIMAP  state.  If  all  preconditions  are  met,  the  action 
is  executed  by  the  actuators  in  the  lower  layers. 


21 


Static  Enemy  Scenario 

In  this  scenario,  there  are  fixed  enemy  units  endangering  from  cover  some  of  the  waypoints 
(we  assume  the  AFV  cannot  attack  them  frontally)  and  an  allied  infantry  squad  which  is 
able  to  disarm  the  enemy  units  by  ambushing  them  (moving  to  their  position  from  the 
opposite  direction  they  watch),  thus  freeing  the  passage. 

The  safety  of  the  convoy  is  granted  by  a  guard  watching  a  waypoint  to  which  the  convoy 
is  moving  (the  convoy  cannot  move  to  a  waypoint  which  is  not  guarded).  A  single  guard 
can  guard  a  single  waypoint  which  must  be  in  line-of-sight  from  the  position  of  the  guard. 
When  the  guard  is  watching  a  waypoint,  it  cannot  move.  A  guard  can  move  only  to  a 
position  which  is  not  endangered  by  an  enemy. 

We  use  the  initial  state  of  the  simulated  environment  to  generate  a  formal  multi-agent 
planning  problem  in  form  of  files  in  PDDL  which  is  then  solved  by  the  distributed  multi¬ 
agent  planner  MADLA  developed  within  RT3.  The  resulting  multiagent  plan  is  then 
executed  in  the  simulated  environment  by  the  transport  vehicle,  AFV  and  the  infantry 
squad.  The  vehicles  utilize  a  maneuver-based  trajectory  planner,  which  is  a  part  of  the 
Tactical  Environment  package,  to  plan  trajectories  between  the  waypoints.  Other  actions 
(e.g.,  defense  of  a  waypoint  or  disarming  an  enemy)  are  represented  visually. 

In  order  to  use  the  MADLA  Planner,  a  symbolic  representation  of  the  simulated  world 
has  to  be  generated.  The  symbolic  representation  used  by  the  MADLA  Planner  as  input 
has  the  form  of  Multi-Valued  Planning  Task  (MPT)  [3].  MPT  consists  of  a  finite  set 
of  variables,  each  with  associated  finite  domain  of  possible  values.  A  partial  state  is  an 
assignment  to  some  of  the  variables,  a  state  is  an  assignment  to  all  the  variables.  The 
problem  description  consists  of  an  initial  state,  a  goal  partial  state  and  a  set  of  actions. 
Each  action  consist  of  a  precondition  and  an  effect.  The  precondition  is  an  assignment  to 
some  of  the  variables  (in  fact  a  partial  state)  which  must  be  true  in  a  state  in  order  to 
apply  the  action,  the  effect  is  again  a  partial  assignment  (state)  which  describes  new  values 
of  variables  after  the  action  is  applied. 

As  the  problem  at  hand  is  multiagent,  the  problem  description  is  factored  for  the  separate 
agents.  The  factorization  follows  the  MA-STRIPS  [2]  definition — a  variable-value  pair  is 
public,  if  it  is  shared  among  actions  of  two  (or  more)  distinct  agents,  otherwise  it  is  private. 
An  action  is  public,  if  it  uses  some  of  the  public  variable- value  pairs.  A  public  action  can 
use  both  public  and  private  variable- value  pairs,  such  public  action  is  seen  by  other  agents 
as  a  projection — it  is  stripped-off  all  private  preconditions  and  effects. 

Let  W  be  the  set  of  all  waypoints  in  the  map  (currently  all  waypoints  are  reachable  by 
all  agents).  Each  of  the  agents  is  represented  by  the  following  variables,  values  and  actions: 

Convoy  Position  of  a  convoy  c  is  represented  by  a  single  variable  c-at  with  values  being 
all  waypoints  in  the  map,  i.e.,  c-atG  W.  Target  location  of  the  convoy  is  part  of 
the  goal  state.  For  each  convoy,  each  guard  and  each  two  waypoints,  there  is  an 
action  move-c-from-to-g  which  represents  the  movement  of  convoy  between  the  two 
locations,  while  the  to  waypoint  is  guarded  by  the  guard  g.  Precondition  of  the  action 
are  that  the  convoy  is  at  waypoint  from  and  the  waypoint  to  is  guarded  by  guard  g 
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(guarded-by-g  variable).  The  effect  is  that  the  convoy  is  at  to  and  that  the  guarded 
waypoint  has  been  reached  (see  guard).  The  move  action  is  public,  because  it  shares 
the  guarded-by-g  variable  with  the  guard’s  guard / unguard  actions. 

Guard  Similarly  to  the  convoy,  the  position  of  a  guard  g  is  represented  by  variable  g-at£  W. 
In  addition,  there  is  a  variable  guarded-by-g £  W  U  {_L}  stating  which  waypoint  is 
currently  guarded  by  the  guard  g  (T  represents  that  no  waypoint  is  guarded),  and 
a  guarded-wp-gG  {REACHED,  UNREACHED}  variable,  stating,  whether  the  cur¬ 
rently  guarded  waypoint  has  been  already  reached,  or  not.  This  is  necessary  in  order 
to  synchronize  the  actions  .  A  guard  has  a  simple  move-g-from-to  action  which  is 
private,  and  a  moved- g-from-to  action,  which  is  used  in  the  situation,  when  the  to 
waypoint  is  in  danger  of  some  enemy  (see  the  enemy  description) ,  and  has  additional 
precondition,  that  such  enemy  must  be  disarmed.  This  action  is  public  because  it 
interacts  with  the  infantry  actions.  Another  public  actions  are  the  guard- g-from-to 
actions.  The  guard  actions  are  generated  for  each  two  waypoints  within  a  predefined 
range  and  with  direct  line-of-sight  visibility  between  them  based  on  the  current  envi¬ 
ronment  model.  The  action  guard- g-from-to  has  precondition  {g-at,=from,  guarded- 
by-g=  _L}  and  effect  {guarded-by-g=to,  guarded-wp-g=UNREACHED} .  In  addition, 
there  is  again  a  variant  of  the  action,  named  guardd,  which  is  used  for  the  endangered 
waypoints  and  has  an  additional  precondition  similarly  to  the  moved  action.  Finally, 
there  is  one  private  unguard-g  action  per  guard,  which  if  guarded-wp-g=  REACHED 
changes  the  value  of  guarded-by-g  to  _L. 

Infantry  As  usual,  an  infantry  i  has  one  variable  i-at£  W  describing  its  current  location 
and  three  actions.  First  is  simple  private  move-i-from-to  which  changes  the  value 
of  i-at,  second  is  the  public  moved  variant  with  additional  precondition  {to-danger- 
by-e=  F}  for  all  enemies  endangering  the  waypoint  wp.  Last  action  is  specific  to 
the  infantry  agents,  it  is  a  moveattack-i-from-to-e  action,  which  is  generated  for  each 
waypoint  to  occupied  by  some  enemy  and  each  waypoint  from  connected  to  it.  The 
action  preconditions  are  obvious,  the  infantry  must  be  at  from  and  the  enemy  must 
be  at  to.  The  effect  of  moveattack  action  is  that  all  variables  wp-danger-by-e  are  set 
to  F,  and  that  the  infantry  moves  to  the  waypoint  to.  This  action  is  public. 

Enemy  While  an  enemy  is  not  an  agent  per  se,  there  is  a  variable,  wp-danger-by-e&  {T,  F}, 
associated  with  a  waypoint  wp  and  enemy  e,  if  the  waypoint  is  endangered  by  the 
enemy  (or  the  enemy  is  at  the  waypoint).  This  variable  is  initially  set  to  true  (T), 
and  is  public,  because  it  is  required  in  preconditions  of  guard  actions  and  is  changed 
by  infantry  actions. 

The  decision  making  process  in  the  scenario  with  static  enemy  uses  an  off-line  planning. 

First,  the  agents  invoke  the  MADLA  planner,  in  order  to  synthesize  a  plan  in  cooperation 

with  other  agents.  When  the  goal  state  is  reached  by  any  of  the  agents,  the  plan  reconstruc¬ 
tion  phase  is  initiated.  In  this  phase,  the  agents  reconstruct  the  plan  backwards  from  the 
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goal  state,  starting  from  the  agent  which  reached  the  goal.  Since  it  is  possible  that  multiple 
agents  initiate  the  reconstruction  process  simultaneously,  the  agent’s  id  and  the  plan’s  cost 
are  included  in  the  reconstruction  messages.  The  least  cost  plan  started  by  the  agent  with 
the  lowest  id  is  kept,  other  plans  are  discarded  and  their  reconstruction  is  halted,  when 
the  redundancy  is  detected.  Each  agent  stores  the  parts  of  the  plan  reconstructed  by  the 
particular  agent. 

Once  the  plan  is  reconstructed,  each  agent  has  a  list  of  actions  to  be  executed.  Since  the 
plan  was  constructed  using  a  DIMAP  planner,  this  implicit  synchronization  is  enough  to 
successfully  execute  the  plan  in  an  (deterministic)  environment — no  further  communication 
is  needed.  This  approach  also  allows  to  naturally  parallelize  actions  where  possible. 

Dynamic  Enemy  Scenario 

In  contrast  to  the  previous  scenario,  in  this  case,  the  enemy  units  are  not  fixed  and  try 
to  fulfill  their  own  goals.  Types  of  the  units  are  the  same  for  both  sides  of  the  conflict, 
however  their  numbers  in  the  field  can  differ.  The  goal  of  both  sides  is  to  get  their  convoys 
to  the  goal  positions.  The  initial  position  of  the  convoy  of  one  side  is  the  goal  position 
of  the  enemy  convoy  and  vice  versa.  A  winning  side  of  the  conflict  is  that  which  gets  its 
convoy  to  its  goal  position  first.  If  no  side  can  get  its  convoy  to  the  goal  position  the  result 
is  a  draw. 

Each  unit  (regardless  its  type)  can  be  in  the  Dynamic  Enemy  Scenario  incapacitated  by 
an  attack  from  an  infantry.  The  mechanics  of  the  attack  slightly  differs  from  the  previous 
scenario.  The  infantry  attacks  from  its  position  and  does  not  move  during  the  attack.  An 
incapacitated  unit  cannot  act  anymore  (move,  attack,  guard,  etc.).  If  two  infantries  attack 
each  other  in  one  time  step,  both  are  incapacitated.  If  a  convoy  is  incapacitated  its  side 
cannot  win  by  definition  of  the  goal,  but  can  act  such  that  the  conflict  end  as  a  draw. 
As  the  infantry  attacks  directly,  there  are  no  explicitly  endangered  waypoints  as  in  the 
scenario  with  static  enemies.  Implicitly  all  waypoints  the  infantry  can  move  to  from  its 
position  are  endangered. 

A  convoy  no  longer  needs  guarded  waypoints  it  is  moving  to.  An  enemy  unit  cannot 
move  to  a  guarded  waypoint  though.  That  means,  a  guard  can  obviate  an  infantry  to  move 
to  vicinity  of  the  convoy  and  thereby  obviate  the  possibility  of  an  attack  on  the  convoy. 
The  principle  that  a  single  guard  can  guard  a  single  waypoint  which  must  be  in  line-of-sight 
from  its  position  is  kept. 

Additionally,  a  wheeled  unit  cannot  move  to  a  waypoint  already  occupied  by  another 
wheeled  unit.  This  principle  held  implicitly  in  the  scenario  with  the  static  enemies  as 
the  convoy  could  move  only  to  a  guarded  waypoint  and  the  AFV  could  not  guard  the 
same  waypoint  it  was  staying  at.  In  this  scenario,  we  use  a  variable  empty-wp  G  {_L,T} 
indicating  whether  the  waypoint  wp  is  empty  of  wheeled  units. 

The  requirement  on  uncertainty  caused  by  need  to  counteract  the  adversaries  required  de¬ 
ployment  of  an  online  planner  we  developed  in  context  of  RT3,  namely  the  LSI  algorithm — 
Linear  Side  Information  utilizing  Monte-Carlo  sampling.  Since  LSI  was  designed  with  mul- 


24 


tiagentness  in  mind,  it  is  suitable  for  the  simulated  mission  with  more  units  on  both  sides 
of  the  conflict.  The  algorithm  is,  however,  exponential  with  increased  lookahead  depth 
which  is  proportional  to  the  scale  of  the  problem.  As  the  number  of  possible  combina¬ 
tions  of  actions  applicable  in  each  step  grows  exponentially  with  the  number  of  agents,  the 
algorithm  has  to  use  an  additional  assumption  to  provide  decisions  for  all  the  agents  in 
tractable  time.  LSI  achieve  this  by  assuming  the  actions  of  particular  agents  can  be  com¬ 
bined  without  any  influence  on  each  other.  Such  assumption  works  well  in  cases  the  units 
are  separated  or  work  with  different  resources.  Provided  that  the  units  have  to  cooperate, 
or  to  resolve  conflicts  in  their  behavior,  such  assumption  can  be  rather  misleading,  however 
still  it  is  the  state-of-the-art  approach. 

LSI  is  a  selection  algorithm  for  combinations  of  actions  of  the  agents  minimizing  esti¬ 
mation  of  simple  (simulated)  regret  after  executing  the  combination  and  random  acting  of 
both  sides  till  one  of  the  states  is  reached  (Monte  Carlo  sampling).  In  practice,  that  means 
the  algorithm  selects  actions  for  all  agents  for  a  current  state  of  the  environment  such  that 
simulated  rewards  after  a  limited  number  of  samples  is  maximal.  In  a  ganre-like  scenario 
we  described  here,  the  reward  of  the  final  states  can  be  simply  1  for  win,  0  for  tie,  and 
-1  for  loss,  however  simulating  always  all  samples  to  a  final  state  increases  the  complexity 
and  in  case  of  cycles  in  the  game  can  never  terminate.  Therefore  LSI  uses  a  heuristic  eval¬ 
uation  function  which  after  a  predefined  number  of  simulation  steps  evaluates  the  reached 
state  and  uses  the  evaluation  as  the  reward.  In  our  particular  case,  the  heuristic  evaluates 
distance  of  the  convoy  to  the  goal  waypoint  (interval  0-10000)  and  sums  it  with  predefined 
values  of  active  units  (100  for  infantry,  1000  for  a  guard,  and  10000  for  a  convoy). 

As  the  decision  of  the  two  sides  of  the  conflict  are  not  coordinated  (the  agents  are  behav¬ 
ing  competitively),  actions  can  get  into  conflicts.  Although  the  selected  action  combinations 
by  one  side  has  to  be  non-conflicting,  we  have  to  resolve  conflicts  during  execution  between 
the  enemies.  At  each  step  of  the  DIMAP  State  model,  conflicting  actions  are  replaced  by 
empty  actions,  thus  the  conflict  cannot  happen  during  the  following  execution.  Two  actions 
are  in  a  conflict  if  their  effects  change  the  same  variable  in  the  MPT  description,  e.g.,  if 
two  wheeled  units  decide  to  move  to  the  same  waypoint,  the  empty-wp  variable  would  be 
set  by  both  units  to  empty-wp  =  _L.  Such  behavior  is  detected  as  a  conflict  and  both  move 
actions  would  be  replaced  by  empty  actions,  i.e. ,  the  units  will  not  move  in  the  particular 
step. 

The  process  of  Monte  Carlo  sampling  uses  the  STRIPS  applicability  of  actions  described 
in  the  convoy  domain,  therefore  the  action  model  used  is  the  same  as  in  the  static  case, 
but  with  different  decision  making  technique.  To  increase  efficiency  of  picking  a  random 
applicable  action  during  the  simulation,  we  have  implemented  a  action  cache  using  the 
position  of  a  unit  as  the  key.  To  use  the  convoy  domain  for  a  conflict  scenario,  we  had  to 
enrich  and  adjust  the  models  of  the  unit  types: 

Convoy  Each  convoy  c  can  be  alive-c  £  {_L,T}.  Position  of  a  convoy  c  is  represented 
by  a  single  variable  c-at  with  values  being  all  waypoints  in  the  map,  i.e.,  c-af£  W. 
Target  location  of  the  convoy  is  part  of  the  goal  state.  Each  team  has  its  own  set  of 
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goals.  For  each  convoy  and  each  two  waypoints,  there  is  an  action  move-c-from-to 
which  represents  the  movement  of  convoy  between  the  two  locations.  Precondition 
of  the  action  are  that  the  convoy  is  alive- c  =  T,  at  waypoint  from,  the  waypoint  to 
is  empty  of  wheeled  units  ( empty-to  =  T),  and  the  waypoint  to  is  not  guarded  by 
(see  guarded-wp  in  next  paragraph)  an  enemy  guard.  The  effect  is  that  the  convoy  is 
at  to  and  that  the  to  waypoint  is  not  empty  any  more  ( empty-to  =  _L). 

Guard  Similarly  to  the  convoy,  a  guard  g  can  be  alive-g  £  { _L,  T}  and  the  position  is 
represented  by  variable  g-at£  W.  In  addition,  there  is  a  variable  guarding-g£  W  U 
{ _L}  stating  which  waypoint  is  currently  guarded  by  the  guard  g  (_L  represents  that 
no  waypoint  is  guarded).  Moreover,  each  waypoint  wp  is  marked  by  guarded-wp 
as  unguarded  (_L,_L),  guarded  by  one  side  (T,A)  or  (_L,T),  or  guarded  by  both 
sides  of  the  conflict  (T,T).  The  information  what  AFV  guards  the  waypoint  is 
already  in  guarding-g ,  therefore  it  is  not  kept  in  the  guarded-wp  variable.  A  guard 
has  a  move-g-from-to  action  which  behaves  similarly  as  the  move  action  of  a  convoy. 
Another  actions  are  the  guard-g-from-to  actions.  The  guard  actions  are  generated  for 
each  two  waypoints  within  a  predefined  range  and  with  direct  line-of-sight  visibility 
between  them  based  on  the  current  environment  model.  The  action  guard-g-from-to 
has  precondition  {alive-g  =  T,  g-at  =  from,  guarding-g  =  _L, . . .}  and  effect  {guarding- 
g  =  to,...}.  The  actions  are  in  two  variants  for  guarded-to  =  (A,*)  and  guarded- 
to  =  (T,  *).  There  are  also  two  variants  of  reverse  actions  unguard- g-from-to  which 
sets  guarding-g  =  _L  and  reverts  guarded-to.  Additionally,  guards  can  move  only  if 
guarding-g  =  A. 

Infantry  As  usual,  an  infantry  i  can  be  alive-i  £  {_L,T}  and  has  one  variable  i-at  £  W 
describing  its  current  location.  Infantry  can  move  by  action  move-i-from-to  which 
changes  the  value  of  i-at  and  in  contrast  to  convoy  and  guards,  empty-to  =  T  is  not 
required.  The  attack  actions  attack-i-from-to-e  are  generated  for  all  combinations  of 
waypoints  from  and  to  and  enemies  e.  The  infantry  can  attack  only  if  it  is  at  from, 
it  is  alive,  the  enemy  unit  is  at  to  and  it  is  also  alive.  The  effect  of  attack  actions  is 
alive- e  =  A. 

The  decision  making  process  in  the  scenario  with  dynamic  enemies  uses  an  on-line  planning. 
The  agents  on  both  sides  of  the  conflict  invoke  their  instance  of  the  LSI  algorithm,  in  order 
to  get  actions  they  should  execute.  When  the  the  actions  are  selected  by  LSI,  the  conflicts 
are  pair-wise  detected  and  prospectively  resolved  by  the  principle  described  in  the  previous 
paragraphs. 

With  the  conflict  free  actions,  the  agents  runs  the  lower  levels  of  the  DIMAP  architecture 
using  the  selected  actions  by  LSI.  After  low-level  execution  of  the  actions,  and  update  in 
the  DIMAP  State  Storage,  LSI  is  run  again  with  new  DIMAP  state  and  new  actions  are 
selected  for  next  step  and  execution.  This  process  is  repeated  until  one  of  the  final  states 
is  reached. 
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Figure  8:  Selected  initial  states  in  the  scenario  with  static  enemies;  small  setting  (left), 
large  setting  (right). 


Discussion  and  Conclusions 

Although  the  MADLA  planner  and  the  online  planning  algorithm  LSI  were  analyzed  on 
commonly  used  benchmarks  and  the  results  were  presented  in  their  scientific  articles,  the 
research  publications  did  not  analyze  deployment  to  high-fidelity  simulation  and  environ¬ 
ments  with  complex  dynamics  and  low  level  actuators.  As  the  common  format  of  such 
publications  is  not  suitable  for  presenting  such  practically  oriented  work,  we  have  prepared 
a  demonstrator  featuring  both  algorithms  in  a  suitable  tactical  scenario  for  which  we  used 
the  Tactical  Environment  [5]  previously  developed  within  a  series  of  CERDEC-funded 
projects  Tactical  AgentFly  and  Tactical  AgentScout. 

First,  we  have  designed  and  implemented  a  common  model  of  the  tactical  mission  us¬ 
ing  STRIPS  and  PDDL  languages.  The  mission  domain  specification  outlined  the  model 
mechanics  and  defined  types  of  units  and  their  allowed  behavior.  The  specification  was 
validated  with  a  multiagent  planner  on  an  extended  multiagent  model  of  the  problem 
described  in  MA-STRIPS. 

Based  on  the  experience  gain  in  this  prototypical  implementation,  we  have  extended 
the  both  the  domain  description  and  the  integration  of  the  off-line  multiagent  planner. 
Additionally,  we  have  created  a  simulation  of  the  tactical  environment  and  integrated  the 
high-level  STRIPS  model  with  low-level  control  of  units  in  the  simulated  tactical  environ- 
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Figure  9:  Selected  initial  states  in  the  scenario  with  dynamic  enemies;  symmetric  setting 
(left),  asymmetric  setting  (right). 


rnent.  To  validate  such  integration,  we  built  a  generator  of  the  planning  problem  instances 
for  the  off-line  planner  from  the  initial  simulation  states  as  depicted  in  Figure  8.  The  plan¬ 
ner  then  in  beforehand  prepares  a  complete  plan  for  the  mission  and  the  agents  controlling 
the  units  synchronously  execute  the  prescribed  action  ending  in  one  of  the  goal  states. 
As  the  multiagent  planner  is  proven  to  always  return  a  valid  plan  and  the  execution  is  in 
the  static  scenario  always  perfect,  the  agents  always  behave  such  that  the  goal  is  reached, 
provided  that  there  exists  a  solution. 

The  assumption  of  a  perfect  action  execution  is  valid  only  for  a  family  of  specialized 
problems.  In  a  general  case,  the  execution  can  fail.  Theoretically,  we  have  studied  three 
types  of  planning  under  uncertainty.  First,  we  have  deepened  our  analysis  from  the  previous 
CERDEC  funded  projects  on  repairing  of  multiagent  plans  in  form  of  prefix  and  suffix  reuse. 
Deployment  of  these  techniques  was  published  as  part  of  [4] .  Second,  we  have  proposed  a 
tractable  compilation  scheme  for  fault-tolerant  planning  with  a  limit  on  number  of  possible 
execution  failures.  Finally,  we  have  proposed  new  algorithms  in  the  family  of  Monte  Carlo 
Tree  Search  and  proposed  a  multiagent  variant  of  Monte  Carlo  sampling  for  multiagent 
scenarios  with  actions  executed  in  parallel. 

The  online  multiagent  planning  algorithm  was  adapted  to  work  with  the  STRIPS  model 
of  the  actions  in  the  model  of  a  tactical  mission  and  reusing  and  extending  the  principles 
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of  integration  of  a  off-line  planner,  we  were  able  to  extend  the  mission  to  complete  conflict 
scenario  with  different  types  of  units  on  both  sides  causing  uncertainty  in  the  execution 
of  the  planned  actions.  The  uncertainty  caused  by  a  intelligent  opponent  are  the  “worst” 
which  can  happen,  therefore  the  successful  behavior  of  the  agents  against  an  opponent 
validates  the  principle  and  shows  that  the  winning  or  losing  ratio  depends  on  numbers 
of  the  units.  The  initial  stats  of  the  conflict  scenarios  are  depicted  in  Figure  9.  The 
algorithm  in  each  step  of  the  simulation  analyses  the  most  prospective  action  of  the  agents 
which  are  then  executed  by  the  respective  tactical  units.  The  demonstrator  show  the 
behavior  copies  the  positive  result  of  the  algorithm  in  the  synthetic  problems  used  in  the 
theoretical  analysis,  thus  validates  applicability  of  the  online  planning  under  uncertainty  in 
the  simulated  high-fidelity  simulation  of  a  tactical  mission  in  the  margins  of  the  assumptions 
and  model  used. 
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Appendix  A.  Results  Access  &  File  Structure 

Access  to  the  project  repository: 

url:  https://webdav.agents.fel.cvut.cz/data/projects/robust-planning/ 
username:  FA8655- 12-1-2096 
password:  robustmultiagentplanning 

File  structure  of  the  project  repository: 

reports  contain  all  project  reports 

papers  contain  all  research  papers  produces  during  the  project 

videos  contain  videos  captured  from  the  demonstrator 

tactical-env-demo  The  video  shows  2D  and  3D  visualization  of  the  tactical 
environment  used  for  the  demonstrator  together  with  all  used  unit  types: 
the  infantry,  the  guard  (AFV  Stryker)  and  the  convoy. 

static-enemy  Demonstration  of  the  small  and  large  scenarios  with  the  static 
enemy  (see  Figure  8). 

dynamic-enemy-asymmetric  A  asymmetric  conflict  in  the  scenario  with  the 
dynamic  enemy.  One  of  the  sides  of  the  conflict  has  two  more  infantries 
which  allows  it  to  win  both  example  runs. 

dynamic-enemy-symmetric  A  symmetric  variant  of  the  previous  scenario.  In 
this  case,  the  result  is  give  much  more  by  chance.  The  video  shows  three 
runs  resulting  as  a  tie,  a  win  of  one  side  and  a  win  of  the  other  side. 

dynamic-enemy-guards  The  full  scenario  by  the  definition  of  the  convoy  do¬ 
main  with  dynamic  enemies  in  uncertain  environment  using  the  robust 
planning  system  we  have  developed.  The  scenario  includes  the  guard 
AFVs.  Explicit  blocking  of  an  enemy  infantry  by  one  of  the  guards  is 
emphasized.  The  video  contains  two  runs  both  ending  as  a  tie. 

demo  contains  the  demonstrator 

build  The  directory  contains  binaries  of  the  demonstrator  for  JDK7. 

src  The  directory  contains  Java  sources  of  the  demonstrator  with  dependencies. 

To  run  the  demonstrator,  Java  JRE7  has  to  be  installed,  at  least  4GB  RAM 
available  and  for  3D  visualization  a  OpenGL  3.0  graphics  card  installed. 

To  build  the  demonstrator  from  sources  Java  JDK7  has  to  be  installed.  Use 
Maven  (https://maven.apache.org/).  For  all  but  robust-planning-demo  run  mvn 
install,  for  robust-planning-demo  run  mvn  package. 
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Appendix  B.  Selected  Research  Results 
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Abstract 

We  consider  online  planning  in  Markov  decision  processes  (MDPs).  In  online  planning, 
the  agent  focuses  on  its  current  state  only,  deliberates  about  the  set  of  possible  policies  from 
that  state  onwards  and,  when  interrupted,  uses  the  outcome  of  that  exploratory  deliberation 
to  choose  what  action  to  perform  next.  The  performance  of  algorithms  for  online  planning 
is  assessed  in  terms  of  simple  regret,  which  is  the  agent’s  expected  performance  loss  when 
the  chosen  action,  rather  than  an  optimal  one,  is  followed. 

To  date,  state-of-the-art  algorithms  for  online  planning  in  general  MDPs  are  either 
best  effort,  or  guarantee  only  polynomial-rate  reduction  of  simple  regret  over  time.  Here 
we  introduce  a  new  Monte-Carlo  tree  search  algorithm,  BRUE,  that  guarantees  exponential- 
rate  reduction  of  simple  regret  and  error  probability.  This  algorithm  is  based  on  a  simple 
yet  non-standard  state-space  sampling  scheme,  MCTS2e,  in  which  different  parts  of  each 
sample  are  dedicated  to  different  exploratory  objectives.  Our  empirical  evaluation  shows 
that  BRUE  not  only  provides  superior  performance  guarantees,  but  is  also  very  effective  in 
practice  and  favorably  compares  to  state-of-the-art.  We  then  extend  BRUE  with  a  variant 
of  “learning  by  forgetting.”  The  resulting  set  of  algorithms,  BRUE(a),  generalizes  BRUE, 
improves  the  exponential  factor  in  the  upper  bound  on  its  reduction  rate,  and  exhibits  even 
more  attractive  empirical  performance. 


1.  Introduction 

Markov  decision  processes  (MDPs)  are  a  standard  model  for  planning  under  uncertainty  (Put- 
erman,  1994).  An  MDP  ( S ,  A,Tr ,  R)  is  defined  by  a  set  of  possible  agent  states  S,  a  set  of 
agent  actions  A,  a  stochastic  transition  function  Tr  :  S  x  Ax  S [0,1],  and  a  reward  func¬ 
tion  R  :  S x  Ax  S  — >•  M.  Depending  on  the  problem  domain  and  the  representation  language, 
the  description  of  the  MDP  can  be  either  declarative  or  generative  (or  mixed).  In  any  case, 
the  description  of  the  MDP  is  assumed  to  be  concise.  While  declarative  models  provide 
the  agents  with  greater  algorithmic  flexibility,  generative  models  are  more  expressive,  and 
both  types  of  models  allow  for  simulated  execution  of  all  feasible  action  sequences,  from 
any  state  of  the  MDP.  The  current  state  of  the  agent  is  fully  observable,  and  the  objective 
of  the  agent  is  to  act  so  to  maximize  its  accumulated  reward.  In  the  finite  horizon  setting 
that  will  be  used  for  most  of  the  paper,  the  reward  is  accumulated  over  some  predefined 
number  of  steps  H. 

The  desire  to  handle  MDPs  with  state  spaces  of  size  exponential  in  the  size  of  the  model 
description  has  led  researchers  to  consider  online  planning  in  MDPs.  In  online  planning, 
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the  agent,  rather  than  computing  a  quality  policy  for  the  entire  MDP  before  taking  any 
action,  focuses  only  on  what  action  to  perform  next.  The  decision  process  consists  of  a 
deliberation  phase,  aka  planning,  terminated  either  according  to  a  predefined  schedule  or 
due  to  an  external  interrupt,  and  followed  by  a  recommended  action  for  the  current  state. 
Once  that  action  is  applied  in  the  real  environment,  the  decision  process  is  repeated  from 
the  obtained  state  to  select  the  next  action  and  so  on. 

The  quality  of  the  action  a,  recommended  for  state  s  with  H  steps-to-go,  is  assessed  in 
terms  of  the  probability  that  a  is  sub-optimal,  and  in  terms  of  the  (closely  related)  measure 
of  simple  regret  A#[s,  a].  The  latter  captures  the  performance  loss  that  results  from  taking 
a  and  then  following  an  optimal  policy  7 r*  for  the  remaining  H  —  1  steps,  instead  of  following 
7T*  from  the  beginning  (Bubeck  &  Munos,  2010).  That  is, 

A H[s,  a]  =  Qh(s,  7 t*(s,  H))  -  QH(s ,  a), 


where 

Qh(s,  a)  =  Es/  [R(s,  a,  s')  +  7 r*(s',  H  -  1))]  . 

With  a  few  recent  exceptions  developed  for  declarative  MDPs  (Bonet  &  Geffner,  2012; 
Kolobov,  Mausarn,  &  Weld,  2012;  Busoniu  &  Munos,  2012),  most  algorithms  for  online 
MDP  planning  constitute  variants  of  what  is  called  Monte-Carlo  tree  search  (MCTS).  One 
of  the  earliest  and  best-known  MCTS  algorithms  for  MDPs  is  the  sparse  sampling  algorithm 
by  Kearns,  Mansour,  and  Ng  (Kearns,  Mansour,  &  Ng,  1999).  Sparse  sampling  offers  a  near- 
optimal  action  selection  in  discounted  MDPs  by  constructing  a  sampled  lookahead  tree  in 
time  exponential  in  discount  factor  and  suboptimality  bound,  but  independent  of  the  state 
space  size.  However,  if  terminated  before  an  action  has  proved  to  be  near-optimal,  sparse 
sampling  offers  no  quality  guarantees  on  its  action  selection.  Thus  it  does  not  really  fit 
the  setup  of  online  planning.  Several  later  works  introduced  interruptible,  anytime  MCTS 
algorithms  for  MDPs,  with  UCT  (Kocsis  &  Szepesvari,  2006)  probably  being  the  most 
widely  used  such  algorithm  these  days.  Anytime  MCTS  algorithms  are  designed  to  provide 
convergence  to  the  best  action  if  enough  time  is  given  for  deliberation,  as  well  as  a  gradual 
reduction  of  performance  loss  over  the  deliberation  time  (Sutton  &  Barto,  1998;  Peret  & 
Garcia,  2004;  Kocsis  &  Szepesvari,  2006;  Coquelin  &;  Munos,  2007;  Cazenave,  2009;  Rosin, 
2011;  Tolpin  &;  Shimony,  2012).  While  UCT  and  its  successors  have  been  devised  specifically 
for  MDPs,  some  of  these  algorithms  are  also  successfully  used  in  partially  observable  and 
adversarial  settings  (Geliy  &;  Silver,  2011;  Sturtevant,  2008;  Bjarnason,  Fern,  &;  Tadepalli, 
2009;  Balia  &;  Fern,  2009;  Eyerich,  Keller,  &  Helmert,  2010). 

In  general,  the  relative  empirical  attractiveness  of  the  various  MCTS  planning  algorithms 
depends  on  the  specifics  of  the  problem  at  hand  and  cannot  usually  be  predicted  ahead  of 
time.  When  it  comes  to  formal  guarantees  on  the  expected  performance  improvement  over 
the  planning  time,  very  few  of  these  algorithms  provide  such  guarantees  for  general  MDPs, 
and  none  breaks  the  barrier  of  the  worst-case  only  polynomial-rate  reduction  of  simple  regret 
and  choice-error  probability  over  time. 

This  is  precisely  our  contribution  here.  We  introduce  a  new  Monte-Carlo  tree  search 
algorithm,  BRUE,  that  guarantees  exponential-rate  reduction  of  both  simple  regret  and 
choice-error  probability  over  time,  for  general  MDPs  over  finite  state  spaces.  The  algorithm 
is  based  on  a  simple  and  efficiently  implement  able  sampling  scheme,  MCTS2e,  in  which 
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MCTS:  [input:  (S,  A,Tr,  R);  so  G  S] 
search  tree  T  X—  root  node  so 
while  time  permits: 

p  <—  sample(soi  T) 

T  x—  expand-tree(T,  p) 
update-statistics(T,  p) 
return  recommend-action(so,  T) 


Figure  1:  High-level  scheme  for  regular  Monte-Carlo  tree  sampling. 


different  parts  of  each  sample  are  dedicated  to  different  competing  exploratory  objectives. 
The  motivation  for  this  objective  decoupling  came  from  a  recently  growing  understanding 
that  the  current  MCTS  algorithms  for  MDPs  do  not  optimize  the  reduction  of  simple  regret 
directly,  but  only  via  optimizing  what  is  called  cumulative  regret,  a  performance  measure 
suitable  for  the  (very  different)  setting  of  reinforcement  learning  (Bubeck  &  Munos,  2010; 
Busoniu  &  Munos,  2012;  Tolpin  &  Shimony,  2012;  Feldman  &  Domshlak,  2012).  Our 
empirical  evaluation  on  some  standard  MDP  benchmarks  for  comparison  between  MCTS 
planning  algorithms  shows  that  BRUE  not  only  provides  superior  performance  guarantees, 
but  is  also  very  effective  in  practice  and  favorably  compares  to  state  of  the  art.  We  then 
extend  BRUE  with  a  variant  of  “learning  by  forgetting.”  The  resulting  family  of  algorithms, 
BRUE(a),  generalizes  BRUE,  improves  the  exponential  factor  in  the  upper  bound  on  its 
reduction  rate,  and  exhibits  even  more  attractive  empirical  performance. 

2.  Monte-Carlo  Planning 

MCTS,  a  high-level  scheme  for  Monte-Carlo  tree  search  that  gives  rise  to  various  specific 
algorithms  for  online  MDP  planning,  is  depicted  in  Figure  1.  Starting  with  the  current  state 
so,  MCTS  performs  an  iterative  construction  of  a  tree  T  rooted  at  so-  At  each  iteration, 
MCTS  issues  a  state-space  sample  from  sq,  expands  the  tree  T  using  the  outcome  of  that 
sample,  and  updates  information  stored  at  the  nodes  of  T.  Once  the  simulation  phase  is 
over,  MCTS  uses  the  information  collected  at  the  nodes  of  T  to  recommend  an  action  to 
perform  in  so-  For  compatibility  of  the  notation  with  prior  literature,  in  what  follows  we 
refer  to  the  tree  nodes  via  the  states  associated  with  these  nodes.  Note  that,  due  to  the 
Markovian  nature  of  MDPs,  it  is  unreasonable  to  distinguish  between  nodes  associated  with 
the  same  state  at  the  same  depth.  Hence,  the  actual  graph  constructed  by  most  instances 
of  MCTS  forms  a  DAG  over  nodes  (s,  /i)s5x{0,l,... ,  H}.  By  A(s)  C  A  in  what  follows, 
we  refer  to  the  subset  of  actions  applicable  in  state  s. 

Numerous  concrete  instances  of  MCTS  have  been  proposed,  with  UCT  (Kocsis  &  Szepesvari, 
2006)  probably  being  the  most  popular  such  algorithm  these  days  (Geliy  &  Silver,  2011; 
Sturtevant,  2008;  Bjarnason  et  ah,  2009;  Balia  &  Fern,  2009;  Eyerich  et  al.,  2010;  Keller 
&  Eyerich,  2012a).  To  give  a  concrete  sense  of  MCTS’s  components,  as  well  as  to  ground 
some  intuitions  discussed  later  on,  below  we  describe  the  specific  setting  of  MCTS  corre¬ 
sponding  to  the  core  UCT  algorithm,  and  Figure  2  illustrates  the  UCT  tree  construction, 
with  n  denoting  the  number  of  state-space  samples. 
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•  sample:  The  samples  p  =  (so,  a\,  si, . . . ,  a*,,  sQ  are  all  issued  from  the  root  node  so- 

The  sample  ends  either  when  a  sink  state  is  reached,  that  is,  A(sfc)  =  0,  or  when 
k  =  H.  Each  node/action  pair  (s,a)  is  associated  with  a  counter  n(s,a )  and  a  value 
accumulator  Q(s,a).  Both  7i(s,a)  and  Q(s,a )  are  initialized  to  0,  and  then  updated 
by  the  update-statistics  procedure.  Given  Sj,  the  next-on-the-sample  action  i  is 
selected  according  to  the  deterministic  UCB1  policy  (Auer,  Cesa-Bianchi,  &  Fischer, 
2002a),  originally  proposed  for  optimal  cumulative  regret  minimization  in  stochastic 
multi-armed  bandit  (MAB)  problems  (Robbins,  1952):  If  n(si,  a)  >  0  for  all  a  G  A(si), 
then  _ 

'q^M,  (i) 

n(si,  a) 

where  n(s)  =  ^2an(s,  a).  Otherwise,  at+i  is  selected  uniformly  at  random  from  the 
still  unexplored  actions  {a  G  A(sj)  |  n(si,a)  =  0}.  In  both  cases,  Sj+i  is  then  sam¬ 
pled  according  to  the  conditional  probability  P(5|sj, aj+i),  induced  by  the  transition 
function  Tr. 

•  expand-tree:  Each  state-space  sample  p  =  (so,  a\,  si, . . . ,  a^,  sQ  induces  a  state  trace 
(so,  si, . . . ,  Si)  inside  T,  as  well  as  a  state  trace  (sj+i, . . . ,  sQ  outside  of  T ■  In  principle, 
T  can  be  expanded  with  any  prefix  of  (sj+i,  •  •  • ,  Sfc);  a  popular  choice  in  prior  work 
appears  to  be  expanding  T  with  only  the  upper-most  node  Sj+i.  (If  T  is  constructed 
as  a  DAG,  it  is  expanded  with  the  first  node  along  p  that  leaves  T.) 

•  update-statistics:  For  each  node  Si  along  p  that  is  now  part  of  the  expanded  tree  T, 
the  counter  n(sj,a,:+i)  is  incremented  and  the  estimated  Q-value  is  updated  as 


al+\  =  argrnax 


Q(si,a)  +c* 


£+1 ) 


Q(Si,  Oj+l)  + 


If/  tt?.+ 1 ) 

71  (si ,  U^_(_l) 


(2) 


where  Rj  =  V';  R(sj,aj+i,Sj+i). 

•  recom mend-action:  Interestingly,  the  action  recommendation  protocol  of  UCT  was 
never  properly  specified,  and  different  applications  of  UCT  adopt  different  decision 
rules,  including  maximization  of  the  estimated  Q-value,  of  the  augmented  estimated 
Q-value  as  in  Eq.  1,  of  the  number  of  times  the  action  was  selected  during  the  sim¬ 
ulation,  as  well  as  randomized  protocols  based  on  the  information  collected  at  the 
root. 


The  key  property  of  UCT  is  that  its  exploration  of  the  search  space  is  obtained  by 
considering  a  hierarchy  of  forecasters,  each  minimizing  its  own  cumulative  regret,  that  is, 
the  loss  of  the  total  reward  incurred  by  exploring  the  environment  (Auer  et  ah,  2002a). 
Each  such  pseudo-agent  forecaster  corresponds  to  a  state/steps-to-go  pair  ( s,h ).  In  that 
respect,  according  to  Theorem  6  of  Kocsis  and  Szepesvari  (2006),  UCT  asymptotically 
achieves  the  best  possible  (logarithmic)  cumulative  regret.  However,  as  recently  pointed  out 
in  numerous  works  (Bubeck  &  Munos,  2010;  Busoniu  &  Munos,  2012;  Tolpin  &  Shimony, 
2012;  Feldman  &  Domshlak,  2012),  cumulative  regret  does  not  seem  to  be  the  right  objective 
for  online  MDP  planning,  and  this  is  because  the  rewards  “collected”  at  the  simulation 
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Figure  2:  Illustration  of  the  UCT  dynamics 


phase  are  fictitious.  Furthermore,  the  work  of  Bubeck,  Munos,  and  Stoltz  (2011)  on  multi¬ 
armed  bandits  shows  that  minimizing  cumulative  regret  and  minimizing  simple  regret  are 
somewhat  competing  objectives.  Indeed,  the  same  Theorem  6  of  Kocsis  and  Szepesvari 
(2006)  claims  only  a  polynomial-rate  reduction  of  the  probability  of  choosing  a  non-optimal 
action,  and  the  results  of  Bubeck  et  al.  (2011)  on  simple  regret  minimization  in  MABs  with 
stochastic  rewards  imply  that  UCT  achieves  only  polynomial-rate  reduction  of  the  simple 
regret  over  time.  Some  attempts  have  recently  been  made  to  adapt  UCT,  and  MCTS-based 
planning  in  general,  to  optimizing  simple  regret  in  online  MDP  planning  directly,  and  some 
of  these  attempts  were  empirically  rather  successful  (Tolpin  &  Shimony,  2012;  Hay,  Shimony, 
Tolpin,  &  Russell,  2012).  However,  to  the  best  of  our  knowledge,  none  of  them  breaks  UCT’s 
barrier  of  the  worst-case  polynomial-rate  reduction  of  simple  regret  over  time. 


3.  Simple  Regret  Minimization  in  MDPs 

We  now  show  that  exponential-rate  reduction  of  simple  regret  in  online  MDP  planning  is 
achievable.  To  do  so,  we  first  motivate  and  introduce  a  family  of  MCTS  algorithms  with  a 
two-phase  scheme  for  generating  state  space  samples,  and  then  describe  a  concrete  algorithm 
from  this  family,  BRUE,  that  (1)  guarantees  that  the  probability  of  recommending  a  non- 
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optimal  action  asymptotically  convergences  to  zero  at  an  exponential  rate,  and  (2)  achieves 
exponential-rate  reduction  of  simple  regret  over  time. 

3.1  Exploratory  concerns  in  online  MDP  planning 

The  work  of  Bubeck  et  al.  (2011)  on  pure  exploration  in  multi-armed  bandit  (MAB)  prob¬ 
lems  was  probably  the  first  to  stress  that  the  minimal  simple  regret  can  increase  as  the 
bound  on  the  cumulative  regret  is  decreases.  At  a  high  level,  Bubeck  et  al.  (2011)  show 
that  efficient  schemes  for  simple  regret  minimization  in  MAB  should  be  as  exploratory  as 
possible,  thus  improving  the  expected  quality  of  the  recommendation  issued  at  the  end  of 
the  learning  process.  In  particular,  they  showed  that  the  simple  round-robin  sampling  of 
MAB  actions,  followed  by  recommending  the  action  with  the  highest  empirical  mean,  yields 
exponential-rate  reduction  of  simple  regret,  while  the  UCB1  strategy  that  balances  between 
exploration  and  exploitation  yields  only  polynomial-rate  reduction  of  that  measure.  In  that 
respect,  the  situation  with  MDPs  is  seemingly  no  different,  and  thus  Monte-Carlo  MDP 
planning  should  focus  on  exploration  only.  However,  the  answer  to  the  question  of  what  it 
means  to  be  “as  exploratory  as  possible”  with  MDPs  is  less  straightforward  than  it  is  in 
the  special  case  of  MABs. 

For  an  intuition  as  to  why  the  “pure  exploration  dilemma”  in  MDPs  is  somewhat  com¬ 
plicated,  consider  the  state/steps-to-go  pairs  (s,  h)  as  pseudo-agents,  all  acting  on  behalf  of 
the  root  pseudo-agent  (sq,  H)  that  aims  at  minimizing  its  own  simple  regret  in  a  stochastic 
MAB  induced  by  the  applicable  actions  A(so).  Clearly,  if  an  oracle  would  provide  (sq,H) 
with  an  optimal  action  tt*(sq,H),  then  no  further  deliberation  would  be  needed  until  after 
the  execution  of  ir*(so,H).  However,  the  task  characteristics  of  (sq,H)  are  an  exception 
rather  than  a  rule.  Suppose  that  an  oracle  provides  us  with  optimal  actions  for  all  pseudo¬ 
agents  (s,  h)  but  (so,  H).  Despite  the  richness  of  this  information,  ( sq,H )  in  some  sense 
remains  as  clueless  as  it  was  before:  To  choose  between  the  actions  in  A(so),  (so>  H)  needs, 
at  the  very  least,  some  ordinal  information  about  the  expected  value  of  these  alternatives. 
Hence,  when  sampling  the  futures,  each  non-root  pseudo-agent  (s,  h)  should  be  devoted  to 
two  objectives: 

(1)  identifying  an  optimal  action  7r*(s,  h),  and 

(2)  estimating  the  actual  value  of  that  action,  because  this  information  is  needed  by  the 

predecessor(s)  of  (s,  h )  in  T ■ 

Note  that  both  these  objectives  are  exploratory ,  yet  the  problem  is  that  they  are  some¬ 
what  competing.  In  that  respect,  the  choices  made  by  UCT  actually  make  sense:  Each 
sample  p  issued  by  UCT  at  (s,  h)  is  a  priori  devoted  both  to  increasing  the  confidence  in  that 
some  current  candidate  a '  for  7 r*(s,  h )  is  indeed  7r*(s,  h),  as  well  as  to  improving  the  estimate 
of  Qh{s,  gc),  while  as  if  assuming  that  7 r*(s,  h)  =  a' .  However,  while  such  an  overloading  of 
the  samples  is  unavoidable  in  the  “learning  while  acting”  setup  of  reinforcement  learning, 
this  should  not  necessarily  be  the  case  in  online  planning.  Moreover,  this  sample  overload¬ 
ing  in  UCT  comes  with  a  high  price:  As  it  was  shown  by  Coquelin  and  Munos  (2007),  the 
number  of  samples  after  which  the  bounds  of  UCT  on  both  simple  and  cumulative  regret 
become  meaningful  might  be  as  high  as  hyper-exponential  in  H. 
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3.2  Separation  of  Concerns  at  the  Extreme 

Separating  the  two  aforementioned  exploratory  concerns  is  at  the  focus  of  our  investigation 
here.  Let  so  be  a  state  of  an  MDP  ( S ,  A.  Tr.  R)  with  rewards  in  [0, 1],  K  applicable  actions 
at  each  state,  B  possible  outcome  states  for  each  action,  and  finite  horizon  H.  First,  to 
get  a  sense  of  what  separation  of  exploratory  concerns  in  online  planning  can  buy  us,  we 
begin  with  a  MAB  perspective  on  MDPs,  with  each  arm  in  the  MAB  corresponding  to  a 
“flat”  policy  of  acting  for  H  steps  starting  from  the  current  state  so-  A  “flat”  policy  n  is  a 
minimal  partial  mapping  from  state /steps-to-go  pairs  to  actions  that  fully  specifies  an  acting 
strategy  in  the  MDP  for  H  steps,  starting  at  so-  Sampling  such  an  arm  -k  is  straightforward 
as  7 r  prescribes  precisely  which  action  should  be  applied  at  every  state  that  can  possibly 
be  encountered  along  the  execution  of  7r.  The  reward  of  such  an  arm  7r  is  stochastic,  with 
support  [0,  H],  and  the  number  of  arms  in  this  schematic  MAB  is  K'  =  K^i= o  B *  «  KbH  . 

Now,  consider  a  simple  algorithm,  Naivellniform,  which  systematically  samples  each 
”flat”  policy  in  a  loop,  and  updates  the  estimation  of  the  corresponding  arm  with  the 
obtained  reward.  If  stopped  at  iteration  n,  the  algorithm  recommends  7r(.so),  where  7r  is 
the  arm/policy  with  best  empirical  value  A^n-  By  the  iteration  n  of  this  algorithm,  each 
arm  will  be  sampled  at  least  |_-  nBll  J  times.  Therefore,  using  the  Hoeffding’s  inequality,  the 
probability  that  the  chosen  arm  7r  is  sub-optinral  in  our  MAB  is  bounded  by 


^{A TT,n  A  A?r *,n}  —  ^  {P"rr,n  Att* ,n 


(-A*)  >  A^}  <  exp 


2  H2  J  ’ 


(3) 


where  An  =  p,n*  —  fin,  and  thus  the  expected  simple  regret  can  be  bounded  as 


E rn  <  H KbH  exp 


2  H2  J  ' 


(4) 


Note  that  NaiveUniform  uses  each  sample  p  =  (so,  ao,  si,  ai, . . . ,  an- 1,  sh)  to  update  the 
estimation  of  only  a  single  policy  n.  However,  recalling  that  arms  in  our  MAB  problem 
are  actually  compound  policies,  the  same  sample  can  in  principle  be  used  to  update  the 
estimates  of  all  policies  tt'  that  are  consistent  with  p  in  the  sense  that,  for  0  <  i  <  H  —  1, 
7 r'(si,H  —  i )  is  defined  and  it  is  defined  as  7 r'(si,H  —  i)  =  ai.  The  resulting  algorithm, 
CraftyUniform,  generates  samples  by  choosing  the  actions  along  them  uniformly  at  random, 
and  uses  the  outcome  of  each  sample  to  update  all  the  policies  consistent  with  it.  Note 
that  sampling  the  arms  in  CraftyUniform  cannot  be  done  systematically  as  in  NaiveUniform 
because  the  set  of  policies  updated  at  each  iteration  is  stochastic. 

Since  the  sampling  is  uniform,  the  probability  of  any  policy  to  be  updated  by  the  sample 
issued  at  any  iteration  of  CraftyUniform  is  Ayr .  For  an  arm  7 A,  let  Nn^n  denote  the  number 
of  samples  issued  at  the  n  iterations  of  CraftyUniform  that  are  consistent  with  the  policy  7 A. 
The  probability  that  tt,  the  best  empirical  arm  after  n  iterations,  is  sub-optinral  is  bounded 
by 


-P  {fa"K,n  ^  A7r*,n}  —  ^  ^A7r,n 


M'TT  — 


/^"7T* 


(5) 
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Each  of  the  two  terms  on  the  right-hand  side  can  be  bounded  as: 

n  .A  „ 

—  .  11,^^  —  1 1  ~-  >  - 

2 


j/br ,n  -  Vn  >  j  <  }  +  P  jlV*., 


n  >  l^Tr,n  ^ 


2 KH  ' 


(t) 

< 


n  { 

e~  2 k2H  +  ^  P  {N„,n  =  i}  P  \  hn,n  ~  I-1 

• _  n  i-i  v 


U-7T  ^ 


A, 


'=2KH+1 


^TT,n  —  ^ 


n  I 

r  a*. 

n  1 

<  e  2X2«  +  p  J 

j^/^7T,n  f^7T  2 

N*’n  ~  2 KH  +  1J 

n  1 

r  | 

n  1 

<  e  2X2H  +  P  J 

j^/^7r,n  Hi r  _  ^ 

N*’n  ~  2 KH  +  1J 

i=— V  +  l 


(t)  n  nA2 

<  e  2/r2H  -|-  e  4 khh2 


<  2e  4«2ffff2  ) 


(6) 


where  (f)  and  (J)  are  by  the  Hoeffding  inequality.  In  turn,  similarly  to  Eq.  4,  the  simple 
regret  for  CraftyUniform  is  bounded  by 


R  H  —  nd 

E r„  <  4HKB  e  *k""h2  . 


(7) 


Since  H  is  a  trivial  upper-bound  on  E?’n,  the  bound  in  Eq.  7  becomes  effective  only  when 
4KbH  exp  <  1,  that  is,  for 


Note  that  this  transition  period  length  is  still  much  better  than  that  of  UCT,  which  is 
hyper-exponential  in  H .  Moreover,  unlike  in  UCT,  the  rate  of  the  simple  regret  reduction 
is  then  exponential  in  the  number  of  iterations. 


3.3  Two-phase  sampling  and  BRUE 

While  both  the  simple  regret  convergence  rate,  as  well  as  the  length  of  the  transition  period 
of  CraftyUniform,  are  more  attractive  than  those  of  UCT,  this  in  itself  is  not  much  of  a 
help:  CraftyUniform  requires  explicit  reasoning  about  KB  arms,  and  thus  it  cannot  be 
efficiently  implemented.  However,  it  does  show  the  promise  of  separation  of  concerns  in 
online  planning.  We  now  introduce  an  MCTS  family  of  algorithms,  referred  to  as  MCTS2e, 
that  allows  utilizing  this  promise  to  a  large  extent. 

The  instances  of  the  MCTS2e  family  vary  along  four  parameters:  switching  point  func¬ 
tion  a  :  N  — >  exploration  policy ,  estimation  policy ,  and  update  policy.  With 

respect  to  these  four  parameters,  the  MCTS  components  in  MCTS2e  are  as  follows. 

•  Similarly  to  UCT,  each  node/action  pair  (s,a)  is  associated  with  variables  n(s,a ) 
and  Q(s,a).  However,  while  counters  n(s,a )  are  initialized  to  0,  value  accumulators 
Q(s,  a)  are  schematically  initialized  to  — oo. 


•  sample:  Each  iteration  of  BRUE  corresponds  to  a  single  state-space  sample  of  the 
MDP,  and  these  samples  p  =  (so,  ai,  si, . . . ,  a Sk }  are  all  issued  from  the  root  node 
so-  The  sample  ends  either  when  a  sink  state  is  reached,  that  is,  A(sfc)  =  0,  or  when 
k  =  H.  The  generation  of  p  is  done  in  two  phases:  At  iteration  n,  the  actions  at  states 
so, ... ,  sCT(n)-i  are  selected  according  to  the  exploration  policy  of  the  algorithm,  while 
the  actions  at  states  sa(n),  . . . ,  Sk-i  are  selected  according  to  its  estimation  policy. 

•  expand-tree:  T  is  expanded  with  the  suffix  of  state  sequence  si, . . . ,  s(7(n)_1  that  is 
new  to  T. 

•  update-statistics:  For  each  state  Sj  G  {so,  •  •  • ,  s<r(n)-i}>  the  update  policy  of  the  al¬ 
gorithm  prescribes  whether  it  should  be  updated.  If  Si  should  be  updated,  then  the 
counter  n(sj,  a,i+ 1)  is  incremented  and  the  estimated  Q- value  is  updated  according  to 
Eq.  2  (p.  4). 

•  recom mend-action:  The  recommended  action  is  chosen  uniformly  at  random  among 
the  actions  a  maximizing  Q(so,a). 

In  what  follows,  for  n  >  0,  the  n-th  iteration  of  BRUE  will  be  called  Ti-iteration  if  cr(n)  =  ~H. 
At  a  high  level,  the  two  phases  of  sample  generation  respectively  target  the  two  exploratory 
objectives  of  online  MDP  planning:  While  the  sample  prefixes  aim  at  exploring  the  options, 
the  sample  suffixes  aim  at  improving  the  value  estimates  for  the  current  candidates  for  ir* . 
In  particular,  this  separation  allows  us  to  introduce  a  specific  MCTS2e  instance,  BRUE,1 
that  is  tailored  to  simple  regret  minimization.  The  BRUE  setting  of  MCTS2e  is  described 
below,  and  Figure  3  illustrates  its  dynamics. 

•  The  switching  point  function  <7  :  N  — ^  {1, . . ,  ,  i?}  is 

a(n)  =  H  —  ((n  —  1)  mod  H),  (9) 

that  is,  the  depth  of  exploration  is  chosen  by  a  round-robin  on  {1 , ,H},  in  reverse 
order. 

•  At  state  s,  the  exploration  policy  samples  an  action  uniformly  at  random,  while  the 
estimation  policy  samples  an  action  uniformly  at  random,  but  only  among  the  actions 
a  G  A(s)  that  maximize  Q(s,a). 

•  For  a  sample  p  issued  at  iteration  n,  only  the  state/action  pair  (so-(n)_i,  aCT(n))  immedi¬ 
ately  preceding  the  switching  state  sCT(n)  along  p  is  updated.  That  is,  the  information 
obtained  by  the  second  phase  of  p  is  used  only  for  improving  the  estimate  at  state 
sa(n)- 1,  and  is  not  pushed  further  up  the  sample.  While  that  may  appear  wasteful 
and  even  counterintuitive,  this  locality  of  update  is  required  to  satisfy  the  formal 
guarantees  of  BRUE  discussed  below. 

Before  we  proceed  with  the  formal  analysis  of  BRUE,  a  few  comments  on  it,  as  well  as 
on  the  MCTS2e  sampling  scheme  in  general,  are  in  place.  First,  the  template  of  MCTS2e  is 

1.  Short  for  Best  Recommendation  with  Uniform  Exploration;  the  name  is  carried  on  from  our  first 

presentation  of  the  algorithm  in  (Feldman  &  Domshlak,  2012),  where  “estimation”  was  referred  to  as 

“recommendation.” 
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rather  general,  and  some  of  its  parametrizations  will  not  even  guarantee  convergence  to  the 
optimal  action.  This,  for  instance,  will  be  the  case  with  a  (seemingly  minor)  modification 
of  BRUE  to  purely  uniform  estimation  policy.  In  short,  MCTS2e  should  be  parametrized 
with  care.  Second,  while  in  what  follows  we  focus  on  BRUE,  other  instances  of  MCTS2e 
may  appear  to  be  empirically  effective  as  well  with  respect  to  the  reduction  of  simple  regret 
over  time.  Some  of  them,  similarly  to  BRUE,  may  also  guarantee  exponential-rate  reduction 
of  simple  regret  over  time.  Hence,  we  clearly  cannot,  and  do  not,  claim  any  uniqueness  of 
BRUE  in  that  respect.  Finally,  some  other  families  of  MCTS  algorithms,  more  sophisticated 
that  MCTS2e,  can  give  rise  to  even  more  (formally  and/or  empirically)  efficient  optimizers  of 
simple  regret.  The  BRUE(a)  set  of  algorithms  that  we  discuss  later  on  is  one  such  example. 
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4.  Upper  Bounds  on  Simple  Regret  Reduction  Rate  with  BRUE 

For  the  sake  of  simplicity,  in  our  formal  analysis  of  BRUE  we  assume  uniqueness  of  the 
optimal  policy  7 r*;  that  is,  at  each  state  s  and  each  number  h  of  steps-to-go,  there  is  a 
single  optimal  action,  and  it  is  n*(s,h).  Let  Tn  be  the  graph  obtained  by  BRUE  after  n 
iterations,  and  let  Qh(s,  a)  denote  the  accumulated  value  Q(s,  a )  for  s  at  depth  H  —  h.  For 
all  state/steps-to-go  pairs  ( s ,  h)  G  Tn,  7r^(s,  h)  is  a  randomized  strategy,  uniformly  choosing 
among  actions  a  maximizing  Qh(s,  a).  We  also  use  some  additional  auxiliary  notation. 

K  =  maxs6g  |A(s)|,  i.e.,  the  maximal  number  of  actions  per  state. 

p  =  mins  o tS':Tr(s,a,s')>o  Tr(s,  a,  s'),  i.e.,  the  likelihood  of  the  least  likely  (but  still  possi¬ 
ble)  outcome  of  an  action  in  our  problem. 

d  =  mins  a  Ai[s,  a],  i.e.,  the  smallest  difference  between  the  value  of  the  optimal  and  a 
second-best  action  at  a  state  with  just  one  step-to-go. 

Our  key  result  on  the  BRUE  algorithm  is  Theorem  1  below.  The  proof  of  Theorem  1,  as 
well  as  of  several  required  auxiliary  claims,  is  given  in  Appendix  A.  Here  we  outline  only 
the  key  issues  addressed  by  the  proof,  and  provide  a  high-level  flow  of  the  proof  in  terms  of 
a  few  central  auxiliary  claims. 


Theorem  1  Let  BRUE  be  called  on  a  state  sq  of  an  MDP  (S,  A,Tr,  R)  with  rewards  in 
[0,1]  and  finite  horizon  H.  There  exist  pairs  of  parameters  c,  d  >  0,  dependent  only  on 
{p,  d,  K,  H},  such  that,  after  n>  H  iterations  of  BRUE,  we  have  simple  regret  bounded  as 

EA^[s,  TTj? (s0,  H)}  <  He  •  e~c'n,  (10) 

and  choice-error  probability  bounded  as 

P  {vrf  (so,  H)  +  7r*(s0,  H)}  <  c  •  e^'".  (11) 


In  particular,  these  bounds  hold  for 

_  4 K3h2~2H{H1)3  nf=11(^!)424^-116(^-1)2 

C  ~  cl2H2-4H+2p3H2-3H 


(12) 


and 

,  3  d^-V*-1  no, 

C  2H16h~1(H\)2K2H  [  1 

Before  we  proceed  any  further,  some  discussion  of  the  statements  in  Theorem  1  are  in 
place.  First,  the  parameters  c  and  c'  in  the  bounds  established  by  Theorem  1  are  problem- 
dependent:  in  addition  to  the  dependance  on  the  horizon  H  and  the  choice  branching  factor 
K  (which  is  unavoidable),  the  parameters  c  and  c'  also  depend  on  the  distribution  param¬ 
eters  p  and  d.  While  it  is  possible  that  this  dependence  can  be  partly  alleviated,  Bubeck 
et  al.  (2011)  showed  that  distribution- free  exponential  bounds  on  the  simple  regret  reduc¬ 
tion  rate  cannot  be  achieved  even  in  MABs,  that  is,  even  in  single-step-to-go  MDPs  (see 
Remark  2  of  Bubeck  et  al.  (2011),  which  is  based  on  a  lower  bound  on  the  cumulative 
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regret  established  by  Auer,  Cesa-Bianchi,  Freund,  &  Schapire,  2002b).  Second,  the  specific 
parameters  c  and  c'  provided  by  Eqs.  12  and  13  are  worst-case  for  MDPs  with  parameters 
d,  p,  and  K,  and  the  bound  in  Eq.  10  becomes  effective  after 


n  > 


ln(c) 

c! 


O 


eH2l 


iterations,  for  some  small  constant  e  >  1.  While  there  is  still  some  gap  with  this  transition 
period  length  and  the  transition  period  length  of  the  theoretical  Craftyllniform  algorithm 
(see  Eq.  8),  this  gap  is  not  that  large.2 

The  proof  of  Lemma  2  below  constitutes  the  crux  of  the  proof  of  Theorem  1.  Once 
we  have  proven  this  lemma,  the  proof  of  Theorem  1  stems  from  it  in  a  more-or-less  direct 
manner. 


Lemma  2  Let  BRUE  be  called  on  a  state  so  of  an  MDP  (S,  A,Tr,  R)  with  rewards  in  [0, 1] 
and  finite  horizon  H.  For  each  h  €  [if],  there  exist  parameters  Ch,c'h  >  0,  dependent  only 
on  { p ,  d ,  K,  H},  such  that,  for  each  state  s  reachable  from  so  in  H  —  h  steps  and  any  t  >  0, 
it  holds  that 


P  i^Qh  ( s ,  a)  -  Qh  (s,  a)  >  ^ 
P  | Qh  ( s,a )  -  Qh  (s,  a)  <  -^ 


nh  (s,  a)  =  t  \  <  che 
nh  (s,  a)  =  t\  <  che~ 


(14) 


In  particular,  these  bounds  hold  for 

_  K2Hh+h2-2H~1(h\)3  nti1(i!)424h-116('1”1)2 

°h  ~  ^2{h-l)2  .  p2Hh+h2-2H-h 


(15) 


and 

,  2>d2{h-l)pH+h-l 

Ch  =  mh-1{h\yi<H+h-1' 


(16) 


The  proof  for  Lemma  2  is  by  induction  on  h.  Starting  with  the  induction  basis  for  h  =  1, 
it  is  easy  to  verify  that,  by  the  Chernoff-Hoeffding  inequality, 


Qi(s,  a)  -  Q\  ( s,a ) 


d 

,  ,  1 

>  - 
~  2 

n  {s,a)  =  t  j 

_<£f 
<2e  2 


(17) 


that  is,  the  assertion  is  satisfied  with  c\  =  1  and  c)  =  .  Now,  assuming  the  claim  holds 

for  h  >  1,  below  we  outline  the  proof  for  h+  1,  relegating  the  actual  proof  in  full  detail  to 
Appendix  A. 

In  the  proof  for  h  >  1,  it  is  crucial  to  note  the  invalidity  of  applying  the  Chernoff- 
Hoeffding  bound  directly,  as  was  done  in  Eq.  17.  There  are  two  reasons  for  this. 

2.  Some  of  this  gap  can  probably  be  eliminated  by  more  accurate  bounding  in  the  numerous  bounding  steps 
towards  the  proof  of  Theorem  1.  However,  all  such  improvements  we  tried  made  the  already  lengthy 
proof  of  Theorem  1  even  more  involved. 
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(FI)  For  h  =  1,  Q  is  an  unbiased  estimator  of  Q ,  that  is,  E Q  =  Q.  In  contrast,  the 
estimates  inside  the  tree  (at  nodes  with  h  >  1)  are  biased.  This  bias  stems  from  Q 
possibly  being  based  on  numerous  sub-optimal  choices  in  the  sub-tree  rooted  in  ( s ,  h). 

(F2)  For  h  =  1,  the  summands  accumulated  by  Q  are  independent.  This  is  not  so  for  h  >  1, 
where  the  accumulated  reward  depends  on  the  selection  of  actions  in  subsequent  nodes, 
which  in  turn  depends  on  previous  rewards. 

However,  we  show  that  these  deficiencies  of  h  >  1  can  still  be  overcome  through  a  novel 
modification  of  the  seminal  Hoeffding-Azuma  inequality. 


Lemma  3  (Modified  Hoeffding-Azuma  inequality)  Let  { be  a  sequence  of  ran¬ 
dom  variables  with  support  [0,  h]  and  m  =  E A*,  //lim^oc  m  =  p,  and 


P  {E  [Xi  |  Xu  . . . ,  Xi-x]  ^p}<  cpe -Cei, 
for  some  0  <  cp  and  0  <  ce  <  1,  then,  for  all  0  <  5  <  4 ,  it  holds  that 


.  i=  1 
t 


.  i= 1 


1 

2h2  ' 

[  +c-pA 

1 

2  h2' 

5 

1  +  CpS2cl 

3 62cei 

e  2  h'2  , 


_3 Sfce. 

e  2h2  . 


(18) 


(19) 

(20) 


Together  with  Lemma  4  below,  the  inequalities  provided  by  Lemma  3  allow  us  to  prove 
the  induction  hypothesis  in  the  proof  of  the  central  Lemma  2.  Note  that  the  specific  bound 
in  Lemma  3  is  selected  so  to  maximize  the  exponent  coefficient.  For  any  0  <  f3  <  1,  the 
probabilities  of  interest  in  Eqs.  19-20  can  also  be  bounded  by 


1  + 


cp 

Ce(l~P)e 


CeP-fi) 


_36fceP, 

e  2/,2 


for  further  details,  we  refer  the  reader  to  Discussion  14  in  Appendix  A. 


Definition  1  Let  ( S ,  A,  Tr ,  R)  be  an  MDP  with  rewards  in  [0, 1] ,  planned  for  initial  state 
so  6  S  and  finite  horizon  H.  Let  s  be  a  state  reachable  from  so  with  h  steps  still  to  go,  let 
a  be  an  action  applicable  in  s,  and  let  i be  a  policy  induced  by  running  BRUE  on  so  until 
exactly  t  >  0  samples  have  finished  their  exploration  phase  with  applying  action  a  at  s  with 
h  —  1  steps  still  to  go.  Given  that, 

•  Xtfi (s,a)  is  a  random  variable,  corresponding  to  the  reward  obtained  by  taking  a  at 
s,  and  then  following  nj3  for  the  remaining  h  —  1  steps. 

•  Etih  ( s,a )  is  the  event  in  which  Xt:h (s,a)  is  sampled  along  the  optimal  actions  at  each 
of  the  h  —  1  choice  points  delegated  to  ttJ3. 

•  St,h  (s,  a)  =  Qh  (s,  a)  -  E  [Xt,h(s,  a)]  . 
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Lemma  4  Let  (S,  A,  TV,  R)  be  an  MDP  with  rewards  in  [0,1],  planned  for  initial  state 
so  €  S  and  finite  horizon  H .  Let  s  be  a  state  reachable  from  so  with  h  +  1  steps  still 
to  go,  and  a  be  an  action  applicable  in  s.  Considering  Etgl+\  ( s,a )  and  5t,h+ 1  (s,o)  as 
Definition  1,  for  any  t  >  0,  if  Lemma  2  holds  for  horizon  h,  then 

pc'h  i 

^{^Et,h+i  (s,  a)}  <  2Kh(2  +  Ch)  e  sx  ,f  (21) 

<Wi(s,a)  <  2Kh2  (2  +  Ch)  e~~&Kt .  (22) 

Together  with  a  modified  version  of  the  Hoeffding-Azuma  bound  in  Lemma  3,  the  bounds 
established  in  Lemma  4  allow  us  to  derive  concentration  bounds  for  Qh+i  around  Qh+i  as 
in  Lemma  5  below,  which  serves  the  key  building  block  for  proving  the  induction  hypothesis 
in  the  proof  of  Lemma  2. 


Lemma  5  Let  BRUE  be  called  on  a  state  so  of  an  MDP  ( S ,  A,  Tr,  R)  with  rewards  in  [0, 1] 
and  finite  horizon  H.  For  each  state  s  reachable  so  with  h  +  1  steps  still  to  go,  each  action 
a  applicable,  and  any  t  >  0,  it  holds  that 


Qh+i  [s,a)  Qh- )-i  (s,  cl) 


d 

>  - 
~  2 


nh+i  (s,  a)  =  t  >  < 


^3456 


K\hFlfch 


d2p2 


J2 


Ppc'h 
16(/i  +  1)2X 


(23) 


5.  Learning  With  Forgetting  and  BRUE(o;) 

When  we  consider  the  evolution  of  action  value  estimates  in  BRUE  over  time  (as  well 
as  in  all  other  Monte-Carlo  algorithms  for  online  MDP  planning),  we  can  see  that,  in 
internal  nodes  these  estimates  are  based  on  biased  samples  that  stem  from  the  selection 
of  non-optimal  actions  at  descendant  nodes.  This  bias  tends  to  shrink  as  more  samples 
are  accumulated  down  the  tree.  Consequently,  the  estimates  become  more  accurate,  the 
probability  of  selecting  an  optimal  action  increases  accordingly,  and  the  bias  of  ancestor 
nodes  shrinks  in  turn.  An  interesting  question  in  this  context  is:  shouldn’t  we  weigh 
differently  samples  obtained  at  different  stages  of  the  sampling  process?  Intuition  tells 
us  that  biased  samples  still  provide  us  with  valuable  information,  especially  when  they 
are  all  we  have,  but  the  value  of  this  information  decreases  as  we  obtain  more  and  more 
accurate  samples.  Hence,  in  principle,  putting  more  weight  on  samples  with  smaller  bias 
could  increase  the  accuracy  of  our  estimates.  The  key  question,  of  course,  is  which  of  all 
possible  weighting  schemes  are  both  reasonable  to  employ  and  preserve  the  exponential-rate 
reduction  of  expected  simple  regret. 

Here  we  describe  BRUE  (cc),  an  algorithm  that  generalizes  BRUE  =  BRUE(l)  by  basing 
the  estimates  only  on  the  a  fraction  of  most  recent  samples.  We  discuss  the  value  of  this 
addition  both  from  the  perspective  of  the  formal  guarantees,  as  well  as  from  the  perspective 
of  empirical  prospects.  BRUE(a)  differs  from  BRUE  in  two  points: 

•  In  addition  to  the  variables  n(s,  a)  and  Q(s,  a),  each  node/action  pair  (s,  a)  in  BRUE(a) 
is  associated  with  a  list  C(s,  a)  of  rewards,  collected  at  each  of  the  n(s,  a)  samples  that 
are  responsible  for  the  current  estimate  Q(s,a). 
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When  a  sample  p  =  (so,  a\,  si, . . . ,  a^,  s fi)  is  issued  at  iteration  n,  and  update-statistics 
updates  the  variables  at  x  =  (sCT(n)_i, aCT(n)),  that  update  is  done  not  according  to 
Eq.  2  as  in  BRUE,  but  according  to: 


n(x)  <— 
C{x)[n{x)] 

Q(x)  i — 


n(x)  +  1, 


fc-i 

^  )  R(si,  aj+i,  Sj+i), 

i=(j(n)  —  1 


1 

\a  ■  n(.x)] 


n(x) 

E 

i=n(rr)  —  |"a-n(ir)] 


(24) 


Theorem  6  Let  BRUE  (a)  6e  called  on  a  state  so  of  an  MDP  (S,  A,Tr,  R)  with  rewards 
in  [0, 1]  and  finite  horizon  H.  There  exist  pairs  of  parameters  c,d  >  0,  dependent  only  on 
{a,p,  d,  K,  H} ,  such  that,  after  n  >  H  iterations  o/BRUE,  we  have  simple  regret  bounded 
as 

EA^[s,  7T,f  (so,  H)}  <  He  •  e~c'n ,  (25) 

and  choice-error  probability  bounded  as 

P  K(so,  H)  +  it* (sQ,  H)}<C  •  e-c'n.  (26) 

The  proof  for  Theorem  6  follows  from  Lemma  7  below  similarly  to  the  way  Theorem  1 
follows  from  Lemma  2.  Note  that  in  Theorem  6  we  do  not  provide  explicit  expressions  for 
the  constants  c  and  d  as  we  did  in  Theorem  1  (for  a  =  1).  This  is  because  the  expressions 
that  can  be  extracted  from  the  recursive  formulas  in  this  case  do  not  bring  much  insight. 
However,  we  discuss  the  potential  benefits  of  choosing  a  <  1  in  the  context  of  our  proof  of 
Theorem  6. 

Lemma  7  Let  BRUE(a)  be  called  on  a  state  so  of  an  MDP  (S,  A,Tr,  R)  with  rewards  in 
[0, 1]  and  finite  horizon  H .  For  each  h  e  [[if],  there  exist  parameters  c^,  c'h  >  0,  dependent 
only  on  { at,p ,  d,  K ,  H},  such  that,  for  each  state  s  reachable  from  sq  in  H  —  h  steps  and  any 
t  >  0,  it  holds  that 


nh  ( s,a )  = 
nh  ( s,a )  = 

The  proof  for  Lemma  7  is  by  induction,  following  the  same  line  of  the  proof  for  Lemma  2. 
In  fact,  it  deviates  from  the  latter  only  in  the  application  of  the  modified  Hoeffding-Azuma 
inequality,  which  has  to  be  further  modified  to  capture  the  partial  sums  as  in  BRUE(a). 

Lemma  8  (Modified  Hoeffding-Azuma  inequality  for  partial  sums)  Let{Xi}fd1  be 
a  sequence  of  random  variables  with  support  [0,  h\  and  m  =  EX,;.  If  Hindoo  pn  =  p,  and 

F{E[Xi\Xu...,  X^]  +  p}  <  cpe-Ce\  (28) 


p  Iq/i  (s,a)  -  Qh  (s,  a)  >  ^ 

P  | Qh  ( s,a )  -  Qh  ( s,a )  < 


t  }  <  che  ch\ 


t}  <che  c'd. 


(27) 
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for  some  0  <  c„  and  0  <  ce  <  1,  then,  for  all  0  <  5  <  | ,  it  holds  that 


^  Xt  >  +  t5  f  < 

^  i=t—  \ai\ 
t 

yy  Xj  <  /xt — 1<5  ^  < 

i=t—  \at\ 


1  +  — - ^e-Ce(l-a)2t 

Ce(l  -  a) 


1  +  — - ^e-Ce(l-a)2t 

Ce(l  -  a) 


_35^Cc 

e  2ha 


3Szce 

e  2h'3 


at 


at 


(29) 

(30) 


Considering  the  benefits  of  “sample  forgetting”  as  in  BRUE(a),  let  us  compare  the  bound 
in  Lemma  8  to  the  bound 


3 Sfcei 
2h2 


(1-/3) 


ce(l~<3) 
2 h? 


provided  by  Lemma  3  for  BRUE,  that  is,  when  all  accumulated  samples  are  averaged.  While 
both  bounds  are  very  similar,  the  exponent  of  the  second  exponential  term  is  multiplied  for 
BRUE(a  <  1)  by  (1  —  a)t.  This  poses  a  tradeoff:  Decreasing  a  reduces  the  sampling  bias, 
and  thus  decreases  the  term  — ,  but  increases  the  other  exponential  term  with  no  leading 
constant.  Obviously,  since  there  is  no  bias  at  leaf  nodes,  it  makes  no  sense  to  set  a  <  1 
there.  However,  as  we  go  further  up  the  tree,  the  bias  tends  to  grow  {cf’  >>  1),  but  we  also 
expect  to  have  more  samples  (t  is  larger).  Thus,  from  the  perspective  of  formal  guarantees, 
it  seems  appealing  to  choose  smaller  values  of  a.  Nevertheless,  we  do  not  try  to  optimize 
here  the  value  of  a:  First,  optimizing  bounds  doesn’t  necessarily  lead  to  optimized  empirical 
accuracy.  Second,  the  underlying  optimization  would  have  to  be  specific  to  each  horizon  h 
and  each  sample  size  t  (which  is  obviously  out  of  the  question) ,  and  thus  anyway  we  would 
have  to  consider  only  some  rough  approximations  to  this  optimization  problem.  Finally, 
biased  samples  in  practice  might  be  more  valuable  than  what  the  theory  suggests,  as  long 
as  all  actions  at  the  same  state/steps-to-go  decision  point  experience  a  similar  bias. 


6.  Experimental  Evaluation 

We  have  evaluated  BRUE  empirically  on  the  MDP  sailing  domain  (Peret  &  Garcia,  2004) 
that  was  used  in  previous  works  for  evaluating  MC  planning  algorithms  (Peret  &  Garcia, 
2004;  Kocsis  &  Szepesvari,  2006;  Tolpin  &;  Shimony,  2012),  as  well  as  on  random  game  trees 
used  in  the  original  empirical  evaluation  of  UCT  (Kocsis  &  Szepesvari,  2006). 

In  the  sailing  domain,  a  sailboat  navigates  to  a  destination  on  an  8-connected  grid 
representing  a  marine  environment,  under  fluctuating  wind  conditions.  The  goal  is  to  reach 
the  destination  as  quickly  as  possible,  by  choosing  at  each  grid  location  a  neighbor  location 
to  move  to.  The  duration  of  each  such  move  depends  on  the  direction  of  the  move  ( ceteris 
paribus ,  diagonal  moves  take  y/2  more  time  than  straight  moves),  the  direction  of  the  wind 
relative  to  the  sailing  direction  (the  sailboat  cannot  sail  against  the  wind  and  moves  fastest 
with  a  tail  wind),  and  the  tack.  The  direction  of  the  wind  changes  over  time,  but  its  strength 
is  assumed  to  be  fixed.  This  sailing  problem  can  be  formulated  as  a  goal-driven  MDP  over 
finite  state  space  and  a  finite  set  of  actions,  with  each  state  capturing  the  position  of  the 
sailboat,  wind  direction,  and  tack. 
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Figure  4:  Empirical  performance  of  BRUE,  BRUE(0.9),  UCT,  and  e-greedy  +  UCT  (denoted 
as  GCT,  for  short)  in  terms  of  the  average  error  on  sailing  domain  problems  on 
n  x  n  grids  with  n  e  {5, 10,  20, 40}. 


In  a  goal-driven  MDP,  the  lengths  of  the  paths  to  a  terminal  state  are  not  necessarily 
bounded,  and  thus  it  is  not  entirely  clear  to  what  depth  BRUE  shall  construct  its  tree. 
In  the  sailing  domain,  we  chose  if  to  be  4  x  n,  where  n  is  the  grid-size  of  the  problem 
instance,  as  it  is  unlikely  that  the  optimal  path  between  any  two  locations  on  the  grid  will 
be  larger  than  a  complete  encircling  of  the  considered  area.  We  note,  however,  that  the 
recommendation-oriented  samples  p  always  end  at  a  terminal  state,  similar  to  the  rollouts 
issued  by  UCT  and  e-greedy  +  UCT. 

Snapshots  of  the  results  for  different  grid  sizes  are  shown  in  Figure  4.  We  compared 
BRUE  with  two  MCTS-based  algorithms:  the  UCT  algorithm,  and  a  recent  modification  of 
UCT,  e-greedy  +  UCT,  obtained  from  the  former  by  replacing  the  UCB1  policy  at  the  root 
node  with  the  e-greedy  policy  (Tolpin  &  Shimony,  2012).  The  motivation  behind  the  design 
of  e-greedy  +  UCT  was  to  improve  the  empirical  simple  regret  of  UCT,  and  the  results  for 
e-greedy  +  UCT  reported  by  (Tolpin  &  Shimony,  2012)  (and  confirmed  by  our  experiments 
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B  =  6/D  =  6 


B  =  2/D  =  16 


Figure  5:  Empirical  performance  of  BRUE,  UCT,  and  e-greedy  +  UCT  (denoted  as  GCT)  in 
terms  of  the  average  error  on  the  random  game  trees  with  branching  factor  B 
and  tree  depth  D. 


here)  are  very  impressive.  We  also  show  the  results  for  BRUEper(0.9),  a  slight  modification 
of  BRUE(0.9)  with  a  more  permissive  update  scheme:  Instead  of  updating  only  the  state- 
action  node  at  the  level  of  the  switching  point,  we  also  update  any  ancestor  for  which  either 
not  all  applicable  actions  have  been  sampled  or  the  chosen  action  was  identical  to  the  best 
empirical  one. 

All  four  algorithms  were  implemented  within  a  single  software  infrastructure.  As  sug¬ 
gested  by  more  recent  works  on  UCT,  the  exploration  coefficient  for  UCT  and  e-greedy  +  UCT 
(parameter  c  in  Eq.  1)  was  set  to  the  empirical  best  value  of  an  action  at  the  decision 
point  (Keller  &  Eyerich,  2012b).  (This  setting  of  the  exploration  coefficient  resulted  in  bet¬ 
ter  performance  of  both  UCT  and  e-greedy  +  UCT  than  with  the  settings  reported  on  the 
sailing  domain  in  the  respective  original  publications.)  The  e  parameter  in  e-greedy  +  UCT 
was  set  to  0.5  as  in  the  experiments  of  Tolpin  &  Shimony,  2012.  Each  algorithm  was  run 
on  1000  randomly  chosen  initial  states  so,  and  the  performance  of  the  algorithm  was  as¬ 
sessed  in  terms  of  the  average  error  Q(so,a)  —  V(so),  that  is,  the  difference  between  the 
true  values  of  the  action  a  chosen  by  the  algorithm  and  that  of  the  optimal  action  7r*(so). 
Consistently  with  the  results  reported  by  Tolpin  and  Shimony  (2012),  on  the  smaller  tasks 
e-greedy  +  UCT  outperformed  UCT  by  a  very  large  margin,  with  the  latter  exhibiting  very 
little  improvement  over  time  even  on  the  smallest,  5x5,  grids.  The  difference  between 
e-greedy  +  UCT  and  UCT  on  the  larger  tasks  was  less  notable.  In  turn,  BRUE  substantially 
outperformed  e-greedy  +  UCT,  with  the  improvement  being  consistent  except  for  relatively 
short  planning  deadlines,  and  BRUEper(0.9)  performed  even  better  than  BRUE. 

The  above  allows  us  to  conclude  that  BRUE  is  not  only  attractive  in  terms  of  the 
formal  performance  guarantees,  but  can  also  be  very  effective  in  practice  for  online  planning. 
Likewise,  the  “learning  with  forgetting”  extension  of  BRUE(cc)  also  has  its  practical  merits. 
Under  the  same  parameter  setting  of  UCT  and  e-greedy  +  UCT,  we  have  also  evaluated  the 
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three  algorithms  in  a  domain  of  random  game  trees  whose  goal  is  a  simple  modeling  of 
two-person  zero-sum  games  such  as  Go,  Amazons  and  Globber.  In  such  games,  the  winner 
is  decided  by  a  global  evaluation  of  the  end  board,  with  the  evaluation  employing  this  or 
another  feature  counting  procedure;  the  rewards  thus  are  associated  only  with  the  terminal 
states.  The  rewards  are  calculated  by  first  assigning  values  to  moves,  and  then  summing  up 
these  values  along  the  paths  to  the  terminal  states.  Note  that  the  move  values  are  used  for 
the  tree  construction  only  and  are  not  made  available  to  the  players.  The  values  are  chosen 
uniformly  from  [0, 127]  for  the  moves  of  MAX,  and  from  [—127,0]  for  the  moves  of  MIN. 
The  players  act  so  to  (depending  on  the  role)  maximize/minimize  their  individual  payoff: 
the  aim  of  MAX  is  to  reach  terminal  s  with  as  high  R(s)  as  possible,  and  the  objective  of 
MIN  is  similar,  mutatis  mutandis.  This  simple  game  tree  model  is  similar  in  spirit  to  many 
other  game  tree  models  used  in  previous  work  (Kocsis  &  Szepesvari,  2006;  Smith  &;  Nau, 
1994),  except  that  the  success/failure  of  the  players  in  measured  not  on  a  ternary  scale  of 
win/lose/draw,  but  via  the  actual  payoffs  they  receive.  We  ran  some  experiments  with  two 
different  settings  of  the  branching  factor  (B)  and  tree  depths  ( D ).  As  in  the  sailing  domain, 
we  compared  the  convergence  rate  obtained  by  BRUE,  UCT  and  e-greedy  +  UCT.  Figure  5 
plots  the  average  error  rate  for  two  configurations,  B  =  6,  D  =  6  and  B  =  2,  D  =  16,  with 
the  average  in  each  setting  obtained  over  500  trees.  The  results  here  appear  encouraging  as 
well,  with  BRUE  overtaking  the  other  two  algorithms  more  quickly  on  the  deeper  trees. 

7.  SUMMARY 

We  have  introduced  BRUE,  a  simple  Monte-Carlo  algorithm  for  online  planning  in  MDPs 
that  guarantees  exponential-rate  reduction  of  the  performance  measures  of  interest,  namely 
the  simple  regret  and  the  probability  of  erroneous  action  choice.  This  improves  over  previous 
algorithms  such  as  UCT,  which  guarantee  only  polynomial-rate  reduction  of  these  measures. 
The  algorithm  has  been  formalized  for  finite  horizon  MDPs,  and  it  was  analyzed  as  such. 
However,  our  empirical  evaluation  shows  that  it  also  performs  well  on  goal-driven  MDPs 
and  two-person  games. 

A  few  questions  remain  for  future  work.  In  the  setting  of  y-discounted  MDPs  with 
infinite  horizons,  a  straightforward  way  to  employ  BRUE  is  to  fix  a  horizon  H,  use  the 
algorithm  as  is,  and  derive  guarantees  on  the  aforementioned  measures  of  interest  by  sim¬ 
ply  accounting  for  the  additive  gap  of  y^i?max/(l  —  y)  between  the  state/action  values 
under  horizon  H  and  those  under  an  infinite  horizon.  However,  this  is  not  necessarily  the 
best  way  to  plan  online  for  infinite-horizon  MDPs,  and  thus  this  setting  requires  further 
inspection.  Second,  it  is  not  unlikely  that  the  state-space  independent  factors  c^,  and  c'h 
in  the  guarantees  of  BRUE  can  be  improved  by  employing  more  sophisticated  combinations 
of  exploration  and  estimation  samples.  Another  important  point  to  consider  is  the  speed 
of  convergence  to  the  optimal  action,  as  opposed  to  the  speed  of  convergence  to  “good'’ 
actions.  BRUE  is  geared  towards  identifying  the  optimal  action,  although  in  many  large 
MDPs,  “good”  is  often  the  best  one  can  hope  for.  To  identify  the  optimal  solution,  BRUE 
devotes  samples  equally  to  all  depths.  However,  focusing  on  nodes  closer  to  the  root  node 
may  improve  the  quality  of  the  recommendation  if  the  planning  time  is  severely  limited. 
Finally,  the  core  tree  sampling  scheme  employed  by  BRUE  differs  from  the  more  standard 
scheme  employed  in  previous  work.  While  this  difference  plays  a  critical  role  in  establishing 
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the  formal  guarantees  of  BRUE,  it  is  still  unclear  whether  that  difference  is  necessary  for 

establishing  exponential-over-time  reduction  of  the  performance  measures. 
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Appendices 

Appendix  A.  Proof  of  Theorem  1 

The  proof  of  Theorem  1  relies  on  the  inductive  assumption  with  respect  to  the  correctness 
of  Lemma  2,  as  well  as  on  several  auxiliary  claims  that  we  prove  in  what  follows.  The 
dependence  diagram  below  depicts  the  overall  flow  of  the  proof,  with  the  more  central 
claims  being  depicted  with  rectangular  nodes. 


Proposition  9  (Concentration  inequality  for  negative-binomial  distributions)  Let 

NB(t,p )  be  a  random  variable  with  negative-binomial  distribution. 


3 1 1  ip 


(31) 


Proof:  It  is  well  known  that  the  event  in  which  the  number  of  Bernoulli  trials  required  to 
obtain  the  f-th  success  is  smaller  than  some  positive  integer  b  is  equivalent  to  the  event  that 
the  number  of  successes  in  b  Bernoulli  trials  is  at  least  t.  Therefore,  for  any  0  <  <5  <  1, 

NB  (t,p)  <  <5- 1  =  P  |Bin  >  t 


=  P  <j  Bin  (  5-,p  )  >  td  +  (t  —  tS) 


(32) 


<  e 

by  the  Hoeffding  inequality, 


and  choosing  6  =  |  yields  the  result. 


Proposition  10  (Number  of  Child  Samples  Bound)  Let  BRUE  be  called  on  a  state 
so  of  an  MDP  ( S ,  A,Tr,  R)  with  rewards  in  [0, 1]  and  finite  horizon  H .  Let  (s,  h)  be  a  node 
reachable  from  (so,  H),  and  in  turn,  ( s',h ')  be  a  node  reachable  from  ( s,h .)  via  an  action 
sequence  that  starts  with  applying  action  a  at  s.  Then,  for  any  a'  £  ^(s'),  we  have 


Ph>  (s',a')t 
^Ph  ( s , a) 


nh  (s,a)  =  t  >  <  2e  6Ph(s’a)  ? 


(33) 


where  ph  ( s,a )  is  the  probability  that  an  ( H  —  h)-iteration  of  BRUE  will  issue  a  sample, 
whose  exploration  phase  ends  with  applying  action  a  at  state  s  with  h  steps  still  to  go. 
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Proof:  By  the  choice  of  the  switching  point  function  of  BRUE  as  in  Eq.  9,  the  number 
of  samples  of  action  a'  in  the  descendant  node  (s',  h')  between  two  consecutive  samples  of 
action  a  in  node  (s,  h)  is  distributed  according  to 

1+7 

X>«  (34) 

i=  1 

where  7  ~  Geo  (ph(s,  a))  and  fa  ~  Ber  (ph'(s',  a'))  are  all  independent  random  variables. 
Indeed,  for  every  pair  of  consecutive  iterations  n  <  n!  with  a (n)  =  a(n')  =  H  —  h, 

(i)  there  is  exactly  one  iteration  n  <  n"  <  n '  with  a(n")  =  H  —  h! ,  and 

(ii)  the  number  of  (H  —  h)- iterations  between  two  consecutive  (H— /i)-iterations  that  finish 
their  exploration  phase  with  applying  action  a  at  s  is  geometric. 

Putting  (i)  and  (ii)  together,  the  number  of  (H  —  h')- iterations  between  a  pair  of  consecutive 
( H  —  /i)-iterations  that  finish  their  exploration  phase  with  applying  action  a  at  s  is  also 
geometric.  In  turn,  the  probability  that  an  ( H  —  /?/)-iteration  will  finish  its  exploration 
phase  with  applying  a'  at  s'  is  (s',  a'),  and  thus  the  number  of  ( H  —  /i')-iterations  that 
finish  their  exploration  phase  with  applying  a'  at  s'  between  a  pair  of  consecutive  (H  —  h)- 
iterations  that  finish  their  exploration  phase  with  applying  action  a  at  s  is  distributed  as 
in  Eq.  34. 

Similarly,  it  can  be  shown  that  the  (conditioned)  random  variable 

nhi  (s',  a')  |  nh  (s,a)  =  t 


is  distributed  according  to 


t+'yt 


where  7 1  ~  NB  ( t,ph  (s,  a)),  fa  ~  Ber  (p^  (s' ,  a')),  and  all  74  and  fa  are  independent. 

Therefore,  denoting  ph(s,  a)  and  ph'(s' ,  a')  by  ph  and  pw,  respectively,  for  short,  we  have 


nh’  (s,  a)  <  - — ph'  nh'  (s,  a)  =  t 
^Ph 


t+ *  t  ) 


00  ( t-\-x 


x=j^+l 


-p  +  E  v\'52Pi<-rr-Ph>\V{ltt  =  x} 


^  4 Ph 

i= 1  y n 


=  pj7i<^-j+  p|Bin((t  +  x),pft/)  <  =  x} 

x=^k+1 

since  are  all  independent  Bernoulli  variables  with  common  parameter  p^r 

f  3t  1  00  ^ 

=  ^yft<^j+  ^  F{Bm((t  +  x)  ,ph>)  <  (t  +  x)ph'  -4}IP{7t  =  x}, 


x=-p-+i 
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where  Sx  =  4xph  ^  Aph)  Ph '■ 

Given  that,  for  all  x  >  ,  we  have 


>{Bin((t  +  x),p/l/)  <  (t  +  x)ph>  -  5X}  <  e 

doeffding 

--*•  2+^y« 


,  ■  ...  ,  ,  ,  .  ,  4xph  -  t(l  -  iph)  t(2  +  4ph) 

by  the  Hoending  inequality,  applicable  here  since  ox  =  - ph/  >  - pht  >  0 


4  Ph 


4; Ph 


<  e 


5 2  ip?/ 

since  — —  >  j  -).  — — 
i  +  ®  4ph 


Plugging  Eq.  37  into  Eq.  36,  we  obtain 


%  (s,a)  <  - — ph, 


4 Ph 


ng  (s,  a)  =  t 


(  Of  -1  00  Pfc/« 

<  P  jTt  <  |  +  ^  e  2pfc  P  {74  =  x} 


X=j^+l 
4Ph 


Prop. 9  ^ ^  _ Fhf 

<  e~~  +  2_^  e  2ph  IP  {it  =  x} 

z=w-+l 

4ph 

*pl 


<  e  6  +  e  2ph 


tp- 


<  2e 


(37) 


(38) 


Proposition  11  Let  BRUE  be  called  on  a  state  so  of  an  MDP  (S,  A,Tr,  R)  with  rewards 
in  [0, 1]  and  finite  horizon  H .  Let  (s,  h  +  1)  be  a  node  reachable  from  (-so,  H),  and  in  turn, 
(s' ,  h')  be  a  node  reachable  from  (s,  h  +  1).  If  Lemma  2  holds  for  horizon  h,  then,  for  any 
a  €  A(s),  a'  G  7t(s'),  and  t  >  1? 


Qh'(s',a')  -  Qh’(s',a!)  >  - 


—  tc . 


h+l  —  h 


nh+\  (s,  a)  =  t  >  <  (2  +  chf)  e  h'  6**+i 


-h' 


(39) 


and 


Qh'(s',  a')  —  Qh'(s' ,  a')  <  -- 


_tc,  sf+fzfL 
nh+i  (s,  a)  =  t  \  <  (2  +  cv)  e  h'  nRh+i-h' 


(40) 


Proof:  The  proof  for  the  two  pairs  of  equations  is  identical,  and  thus  we  explicitly  prove 
here  only  Eq.  39.  In  what  follows,  we  use  Ph(s,  a)  and  pw  (s' ,  a')  as  defined  in  Proposition  10, 
and  here  as  well  denote  them  by  ph  and  pw ,  respectively,  for  short.  Similarly,  by  Qh',  Qh', 
nh'i  and  nh+i  we  refer  to  Qh Qh'(s',a'),  nh’(s',a'),  and  nh+i(s,a),  respectively,  for  short. 
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Qh'  -  Qh<  >  2 


<  P  \nh'  < 


tph' 

tyh+i 


nh+ 1  =  t 


nh+ 1  =  t }  +  P  <J  Qh'  -  Qh'  >  i«/i'  >  7~~~ 

2  4p/j+i 


ra/H-i  =  t 


Prop. 10  Ph' 


<  2e  p'l+1  +  ^  P  <  Qh'  -  Q/,/  >  - 


I.A./Eq.l4  _  <Ph'  , 

<  2e  6pft+i  +  c/,/e  Ti‘'P{n/,-  =  r|n/l+i  =  t} 


r_ 

4p/i+i 

OO 


d 


nh>  =  t  \  P  {nh>  =  r  |  n^+i  =  t} 


(41) 


**bL. 


__  tph' 
4Ph+ i 

‘Ph' 


<  2e  6p'*+1+c/l/e  h,4pM-i. 

Consider  the  fraction  TV_  anf]  recall  that  (s',h')  is  a  descendant  of  (s,h).  The  latter 
implies  that 

(  p  \h+\-h' 

Ph '  >  Ph+i  ) 

v  h-\-l—h' 


and  thus  >  (j^)  1  .  Continuing  now  Eq.  41,  by  Eq.  16, 


°h >  — 


M2(h’-l)pH+h’-l 

32h'-1((h,)\)‘2KH+hl- 


H+h'-l 


< 


P_ \ 

KJ 


H-ti 


<  Ph ' 


and  Eq.  41  under  c'h,  <  pw  implies 


tphi 


—td 


/  _lhL 


2e  6Ph+i+Ch,e  h'iph+i  <{2  +  chf)e  h'6ph+ 1 

/i+l-fc' 


—tc. 


<  (2  +  c/j/)e  ^  skw-'1'  . 


(42) 


Proposition  12  (Expected  accumulated  rewards)  Let  (S,  A,Tr,  R)  be  an  MDP,  and 
let  X  be  the  accumulated  reward  of  a  sample 


P  —  (s,  o,  si ,  cii ,  Sh,  Q) i. ,  s/}_)_ i) , 


started  with  taking  action  a  £  A  in  state  s  G  5,  and  continued  with  additional  h  steps,  in 
which  actions  are  chosen  according  to  some  arbitrary  (possibly  randomized)  policy  ^ r.  Let 

•  Enj l+i  (s,  a)  denote  the  event  in  which,  after  a,  p  is  sampled  along  the  optimal  actions, 
that  is,  for  i  6  |7i],  a*  =  nX  1_i(si),  and 

•  8n,h+i  (s,  a)  =  Qh+ 1  (s,  a)  -  E  [X] . 
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Then, 


P  {-1-^7r,/i+i  (s,  a)} 


^7T,/l+l  (-S,  fl) 


/l 


Fy  ^  {^h+l-i  ( Si )  7^  TTfl+i_i  (Si)}  , 

2=1 

(43) 

h 

y  ^  e  [Ah+i—i  7i7i-i-i— 2(82)]]  • 

(44) 

2=1 


Proof:  The  proof  of  Eq.  43  is  straightforward  by  the  union  bound.  To  prove  Eq.  44,  we 
note  that  for  any  state/steps-to-go  pair  (s,  h)  G  S  x  [//],  we  have 

E, r,s'  [i?  (s,7Tft(s),  s')]  =  E,,-  [Q^(s,  7Tft(s))]  -  E„-jS/  [Qft_l (s',  7T* (s',  h  -  1))]  . 


Using  that,  we  obtain  a  telescopic  series  that  yields 


E  [X]  =  E^)Si:Sh 


h 

R  (5,  tt,  81)  +  ^  ^  R  TTji+i—i^Si) )  S^-i-l) 

i— 1 


=  Q/i+i(s,a)  —  ES1  [Qh(si,n*(suh))]  + 

h 

E(E  7T,si:s»  [Qh+1— ‘TTfi+l— i(^i))]  [Q/i— 2 (^2+1 5  (^2+1?^  0)]) 

2=1 


h 

Qh+ 1(^5^)  ^  ^  p7r,si:g^  [^/i+l— i[^i)  ^T/i— i+l(^i)]]  • 

2=1 


Proof  of  Lemma  4: 

By  Definition  1.  the  event  £^+1  (s,  a)  corresponds  to  a  sample 


P  (^5,  ^l5  CZ-i ,  8^,  CLh,  Sh+i)  , 


obtained  by  taking  action  a  at  state  8,  reachable  from  80  with  h  +  1  steps  still  to  go,  and 
then  following  the  policy  7r^,  induced  by  running  BRUE  on  80  until  exactly  t  >  0  samples 
finish  their  exploration  phase  with  applying  action  a  at  8  with  h  steps  still  to  go.  From 
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Proposition  12,  denoting  *  =  h  +  1  —  i,  we  have 
h 

P  rfEt)h+ 1  (s,  a)}  <  ^  P  {7 Tj  (Si)  /  7T?  (Si)  I  nh+i  (s,  a)  =  t} 

2=1 

h 

<  Y.  Y  p{Q?  {si,a!)  >  Qi(si,rf  (s^)  nh+i(s,a)  =  t} 
*= 1 
h 


< 


r  r  d 

X]  P  \  Qi  (su  a')  -  Qi  (su  rf  >  ~ 

i=l  a'^nf(si) 


nh+i  (s,a)  =  t.\  + 


Qi  (surf  ( Si ))  -  Qi  (si,  rf  ( Si ))  <  -- 


«/H-i  (s,  a)  =  t 


Prop.  11 


up. XX  ^ - > 

<  y  2 K  (2  + 


p  - 

c,- )  e  1 6x 1 


2=1 


(*) 

<  2 ICh  (2  +  C/J  e~tchei?. 


(45) 


The  last  inequality  (*)  in  Eq.  45  holds  because,  by  assuming  Lemma  2  for  horizon  h, 
for  i  G  [[/;] ,  it  can  be  straightforwardly  derived  from  Eqs.  15  and  16  that  c*  >  c,;_  1  and 

Similarly, 


^t,h+l  (®>  rf  —  E  EAi[si,7rf  (si)] 


2=1 

h 

sE 

2=1 

h 

sE 

2=1 


i  ^  P  |q?:  (sj,  a)  >  Qi  (s*,  7if  (s*))  n^+i  (s,  a)  =  fj 

L  a'^TrKsi) 

h  ^  P{<2i  (si,a')  >  Qi  (si,rf  (si))  nh+i  (s,  rf  =  f  j 


Prop.  11  ^ ^  1  pz 

<  2Kh  (2  +  a )  e  *  1  e m 
2=1 

<  2 I<h2  (2  +  erf  e~tchek . 


(46) 


Fact  13  Let  Z  be  a  random  variable  with  support  [a,b\  and  E  [Z]  =  0.  Then,  for  any 
A  G  M+, 

(U  _  n)?  \2 

E  [exp(AZ)]  <  exp(- - - - ). 

This  result  is  well  known  due  to  Hoeffding. 
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Proof  of  Lemma  3  (Modified  Hoeffding-Azuma  inequality): 

Let  Et  be  the  event  that  E  [Xt  \  X\, . . . ,  Xt- 1]  =  g,  and  let 

Yt  =  Xt-g  \X1  (w),...,Xt_i  (w). 


The  random  variable  Yt  is  bounded  by  h  —  g  >  Yt  >  — /x,  and  furthermore,  for  u  G  Et, 
E Yt  =  0.  Therefore,  using  Fact  13,  for  all  c v  €  Et  and  A  G  M+,  it  holds  that 


E 


< 


(47) 


Moreover, 

E 


gAEA=l(M  Xi) 


=  E Et 
=  E Et 

=  E 

A 

<  e 


oAE^iit1  ^i) 


+  E  ^Et 


AEU(m-^) 


E 


ae!=i  b-Xi) 


Xi 


+  E-,et 


=AEl=iMi) 


DA  (m-M) 


‘E 


AEi=i  (m  aq) 


+  P  {->£*}  eAt/i 


+  E- 


■£t 


AEU(mA) 


Eq.18  A2h2 


Te^E  [eA£fciV-*0]  +Cpe*(A/l-ce) 

(*)  \2^2  ^  ,2t2 

T—  1 


by  the  auxiliary  step  in  Eqs.  49-51  below 

0r(Xh-ce) 


(48) 


<  e" 


i + cp  i 


s  e 


T—  1 


Considering  the  recursion 


f(t)  =  Qf{t  -  i)  +  g{t) , 


it  is  easy  to  verify  that,  for  all  0  <  c  <  t, 


(49) 


t 

/(i)=«‘-V(c)+  £  (50) 

T=C+1 


Given  that,  the  bound  (*)  in  Eq.  48  is  obtained  by  setting 


6  =  e- 

f(t)=E 


=aeU(m— 


_  J.(\h-Ce) 


g  ( t )  =  cpe 


(51) 
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Now,  by  Markov  inequality,  for  any  A  >  0, 


fit  —  ^2  Xi  >  tS 


2=1 


<  e~xtSK 


\,c  A zh2t 

<  e~xt°e~~s~ 


i  +  cpy~] 


e  s  e 


r(A  h—Ce) 


2 62ce.  S 


2C2 

-5ft 

=  e  ”e  2/i2 
2<5ce 


r=l 

°0  2  2 

—  6  ce  T  T(2Sce  r  \ 

1  +  cp  ^  e  e  v  &  e' 

T=1 


by  setting  A  = 


h2 


3  <Vce 


<  e  2/j2 


since  5  <  — 

_3 tPce, 

<  e  NhN1 


2 h2  _‘a«l 

1+<W  ^ 


1  +  Cj 


2/?2 

<52c2 


(52) 


The  second  bound  can  be  proven  in  much  the  same  way. 


Discussion  14  Note  that  the  above  bound,  was  obtained  for  a  particular  choice  of  A  that 
maximizes  the  coefficient  term  in  the  exponent.  Other  choices  of  A  may  result  in  a  smaller 
coefficient  in  the  exponent,  but  also  a  decreased  leading  constant.  In  particular,  setting 
A  =  2Su2^ >  for  any  0  <  /3  <  1,  yields  the  following  bound 


fit,  —  ^2  Xi  >  tS 


2=1 


252ce.0, 

<  e  ft2  e  2h'2 


S2c2p2  (26ceP  \ 

T-T{—ffR-cO 


i + cp  ^2 e  ^  ‘ e 

T=1 


36zce/3  + 

<  e  2h‘2 


1  + 


Ce  (1  -  P) 


_  ce(l -P) 

e  2h2 


(53) 


Proof  of  Lemma  5: 

Lemma  4  implies  that,  with  probability  approaching  1  exponentially  fast,  the  state-space 
samples  issued  at  a  level  with  h  +  1  steps-to-go  are  optimal.  That  is,  their  expectation 
equals  the  actual  Q-value.  Therefore,  by  Lemma  4,  we  have 

P  {E  [Xtth+ 1  (s,  a)  |  X1>h+1  (s,  a),...,  Xt_1M1  (a,  a)]  +  Q  (s,  a)}  <  cpe~c^ ,  (54) 

where  cv  =  2Kh(2  +  Ch)  and  ce  =  It  is  also  easy  to  see  that  0  <  X^  <  h  +  1,  and  thus 
the  conditions  of  Lemma  3  are  satisfied.  In  turn,  from  Lemma  3  for  5  =  |  and  random 
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variables  with  support  [0,  h  +  1] 


and,  similarly, 


Qh+ 1  (s,  o)  Qh+1  (s,  —  2 


nfe+i  (s,  a)  =  i 


< 


1  +  c 


2(/i  +  1)^ 


(§)'  « 


3(?)  ce 

g  2(h+l)2 


dH 


<  g  16(/i+1)2K 


1152  • 


A-3(/!  +  1)3(2  +  ch) 


d2p2~ 


72 


Qh+1  (s,  o)  Qh+1  (s,  o)  E  2 


«h+i  (s,  a)  =  t 


,2  / 

dJ^h 


<  g  16(Ji+1)2« 


<  g  16(7i  +  1)2A' 


1152 


3456  • 


K3(h  +  1)3(2  +  Cfe) 
d2p2  d2 

K3{h  +  l)3ch 


d2p2 


72 


since  2  +  cj,  <  3ch. 


(55) 


(56) 


Proof  of  Lemma  2,  induction  step: 

Note  that  the  proof  of  Lemma  5  is  basically  the  proof  of  the  induction  step  for  the  key  part 
of  Lemma  2,  that  is,  Eq.  14.  The  only  thing  that  remains  to  be  finalized  is  the  correctness 
of  Eqs.  15  and  16  for  h  +  1,  and  these  can  be  verified  by  substitution  of  Ch  and  c'h  in  Eq.  56 
by  the  respective  expressions  (for  h )  from  Eqs.  15  and  16.  ■ 

Proof  of  Theorem  1: 

The  proof  for  our  main  results  follows  by  using  the  same  techniques  as  above.  Note  that, 
by  the  Hoeffding  inequality,  after  n  >  0  iterations  of  BRUE,  for  each  action  a  G  A(so),  it 
holds  that 

P  { nH  (s0,  a)  <  }  <  e~^Tn.  (57) 


Given  that, 


{7T®(s0,i7)  +  ir*(s0,H)} 

s  E 

a^n*(s0,H) 


P  {  Qh  (-so, «)  >  Qh  (so,  «)  +  -[>  + 


Qh  (s0,  7t*(s0,  H))  <  Qh  (so,n*(so,H))  - 


d 


(58) 
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For  a  sub-optimal  action  a, 


Qh  (-so; «)  >  Qh  (sq,  a)  +  - 


=  ^  I®  | Qh  (so,  a)  >  Qh  (so,  a )  + 

t=i  ^ 

<«*{»»(«,  “)<^| 
n 

+  ^  IP  (so,  a)  >  Qh  (so,  a)  + 

+  —  I  I  ™  V 


n#  (so,  a)  =  W  IP  {%  (so,  a)  =  t} 


Lemma  2 
< 


nH  (so,  a)  =  t  [>  P  {nH  (s0,  a) 

e~2K^Hn  +  ^2  Cne~c'nt  •  P  {uh  (s0,a)  =  t} 


=  Q 


t=l+7 


<  e  2K*nn  +  cue 


<  2 cue  2khu. 


(59) 


Using  exactly  the  same  line  of  bounding,  we  obtain 

IP  ^Qh  (s0,7r*(s0,i?))  <  Qh  (sq,tt*(sq,H))  —  <2 cHe~^\  (60) 

and  thus 

F{7T^(s0,H)^7T*(s0,H)}  <AKcHe-^Hn.  (61) 

Eqs.  11,  12,  and  13  of  Theorem  1  are  then  obtained  by  substitution  of  ch  and  dH  in  Eq.  61 
with  the  respective  expressions  from  Eqs.  15  and  16.  In  turn,  Eq.  10  of  Theorem  1  stems 
from  Eqs.  11,  horizon  H,  and  per-step  rewards  being  in  [0, 1].  ■ 
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Appendix  B.  Proof  of  Theorem  6 

We  first  prove  the  modified  Hoeffding-Azuma  inequality  for  partial  sums. 


Proof  of  Lemma  8  (Modified  Hoeffding-Azuma  inequality  for  partial  sums): 

Let  Et  be  the  event  that  E  [X t  \  X\, . . . ,  Xt- 1]  =  /./,  and  let 

Yt^Xt-H  \X1  M:- •••*,_!  (*)• 


The  random  variable  Yj  is  bounded  by  h  —  g  >  Yt  >  —g,  and  furthermore,  for  u  G  Ef. 
E Yt  =  0.  Therefore,  using  Fact  13,  for  all  i j  £  Et  and  A  G  M+,  it  holds  that 


E 


< 


(62) 


Moreover, 


E 


e^Ei=t- fail  (h ~Xi ) 
=  E  E, 


gA  Efct- fail  (t*-Xi) 


+  E- 


<Et 


gA  fail  (M-W) 


= 


E 


A  5Zi=t- [at~\  (#*  W) 


ATi, . . . , 


+  E. 


■£t 


0AEi=i~ratl  (M-W) 


=  E 


gA  5Zi=t_  fail  .  ]g 


A(M-W) 


<  e  s  E 

Eq.18  A2h2 
<  e" 


A  S;=(_  |-ati  (M  -X'i) 
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by  the  auxiliary  step  in  Eqs.  49-66  below. 

Considering  the  recursion 

/(*)  =  0f(t~  1 )  +  s(*) 

it  is  easy  to  verify  that,  for  all  0  <  c  <  f, 

t. 
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Given  that,  the  bound  (*)  in  Eq.  48  is  obtained  by  setting 
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Now,  by  Markov  inequality,  for  any  A  >  0, 
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Obviously,  at  leaf  nodes  there  is  no  point  in  choosing  a  <  1  since  there  is  no  bias. 
Therefore,  for  ft  =  1  we  can  use  the  same  constants  c\  =  1  and  dx  =  Since  c'h  is 

decreasing  with  h,  we  have  c'h  <  jf H_h  for  all  1  <  h  <  H,  and  thus  Lemma  4  is  valid. 
Lemma  5  relies  on  the  modified  Hoeffding-Azuma  inequality,  which  is  no  longer  valid  in 
the  context  of  BRUE  (ct).  Instead,  we  apply  its  modification,  Lemma  8  for  partial  sums,  to 
prove  the  induction  step 


Qh+\  (s,  o)  Qh+ 1  (s,  n)  L  2 


nh+ 1  (s,a)  =  t 


3d?  pci 


<  e  4:8 Kh2 


at 


12K2h{2  +  ch) 
pc'Jl-a) 


(68) 


33 


and,  similarly, 
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The  induction  step  is  satisfied,  e.g.,  with  c'h+1  =  min 
12K2h(2+ch) 

Pc'h(l~a) 

Since  Ch  is  increasing  in  h  and  c'h  is  decreasing  in  h,  the  term  12I^c,  also  increases 

in  h.  The  larger  the  constant  grows,  the  more  beneficial  it  might  be  to  increase  the  exponent 
coefficient  that  multiplies  that  constant  by  decreasing  a  at  the  expense  of  decreasing  the 
exponent  coefficient  that  multiplies  1.  Clearly,  the  tradeoff  depends  also  on  t,  the  number 
of  samples  of  action  a  in  node  (s,  h).  Therefore,  as  h  increases,  smaller  values  of  a  would 
be  more  appealing. 
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Abstract 

In  deterministic  oversubscription  planning  (OSP),  the  objective  is  to  achieve  an 
as  valuable  as  possible  subset  of  goals  within  a  fixed  allowance  of  the  total  action 
cost  (Smith,  2004).  Although  numerous  applications  in  various  fields  share  this  objec¬ 
tive,  no  substantial  algorithmic  advances  have  been  made  in  deterministic  OSP,  and 
this  in  contrast  to  a  tremendous  progress  that  has  been  achieved  in  the  area  of  classical 
deterministic  planning  (Russell  &  Norvig,  2009).  Tracing  the  key  sources  of  progress 
in  classical  planning,  we  identify  a  severe  lack  of  effective  domain-independent  approx¬ 
imations  for  OSP. 

With  our  focus  here  on  optimal  planning,  two  classes  of  approximation  techniques 
have  been  found  especially  useful  in  the  context  of  optimal  classical  planning:  those 
based  on  state-space  abstractions  and  these  based  on  logical  landmarks  for  goal  reach¬ 
ability.  The  question  we  study  here  is  whether  some  similar-in-spirit,  yet  possibly 
mathematically  different,  approximation  techniques  can  be  developed  for  OSP.  In  the 
context  of  abstractions,  we  define  the  notion  of  additive  abstractions  for  OSP,  study 
the  complexity  of  deriving  effective  abstractions  from  a  rich  space  of  hypotheses,  and 
reveal  some  substantial,  empirically  relevant  islands  of  tractability.  In  the  context  of 
landmarks,  we  show  how  standard  goal-reachability  landmarks  of  certain  classical  plan¬ 
ning  tasks  can  be  compiled  into  the  OSP  task  of  interest,  resulting  in  an  equivalent  OSP 
task  with  a  lower  cost  allowance,  and  thus  with  a  smaller  search  space.  Our  empirical 
evaluation  confirms  the  effectiveness  of  the  proposed  techniques,  and  opens  a  wide  gate 
for  further  developments  in  oversubscription  planning. 

1.  Introduction 

The  tools  of  automated  action  planning  are  developed  to  allow  autonomous  systems  selecting 
a  course  of  action  “to  make  things  done” .  Deterministic  planning  is  probably  the  most  basic, 
and  thus  the  most  fundamental,  setting  of  automated  action  planning  (Russell  &  Norvig, 
2009).  At  a  high  level,  deterministic  planning  is  a  problem  of  finding  trajectories  of  interest 
in  large-scale  yet  concisely  represented  state-transition  systems.  Computational  approaches 
to  deterministic  planning  vary  around  the  way  those  “trajectories  of  interest”  are  defined. 

The  basic  structure  of  acting  in  situations  with  underconstrained  or  overconstrained 
resources  is  respectively  captured  by  what  these  days  is  called  “classical”  deterministic 
planning  (Fikes  &  Nilsson,  1971),  and  in  what  Smith  (2004)  baptized  as  “oversubscription” 
deterministic  planning  (OSP).  In  classical  planning,  the  task  is  to  find  an  as  cost-effective 
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trajectory  as  possible  to  a  goal-satisfying  state.  In  oversubscription  planning,  the  task  is 
to  find  an  as  goal-effective  (or  valuable )  state  as  possible  via  a  cost-satisfying  trajectory.  In 
optimal  classical  planning  and  in  optimal  OSP,  the  tasks  are  further  constrained  to  finding 
only  most  cost-effective  trajectories  and  most  goal-effective  states,  respectively.  Together, 
classical  planning  and  OSP  constitute  the  most  fundamental  variants  of  deterministic  plan¬ 
ning,  with  many  other  variants  of  deterministic  planning,  such  as  net-benefit  planning  and 
cost-bounded  planning,  being  defined  in  terms  of  mixing  and  relaxing  the  two.1 

While  OSP  has  been  advocated  over  the  years  on  par  with  classical  planning,  so  far, 
the  theory  and  practice  of  classical  planning  have  been  studied  and  advanced  much  more 
intensively.  The  remarkable  success  and  continuing  progress  of  heuristic-search  solvers  for 
classical  planning  is  one  notable  example.  Primary  enablers  of  this  success  are  the  advances 
in  domain-independent  approximations,  or  heuristics,  of  the  cost  needed  to  achieve  a  goal 
state  from  a  given  state.  It  is  thus  possible  that  having  a  similarly  rich  palette  of  effective 
heuristic  functions  for  OSP  would  advance  the  state-of-the-art  in  that  problem. 

With  our  focus  here  on  optimal  planning,  two  classes  of  approximation  techniques  have 
been  found  especially  useful  in  the  context  of  optimal  classical  planning:  those  based  on 
state-space  abstractions  (Edelkamp,  2001;  Haslum,  Botea,  Helmert,  Bonet,  &  Koenig,  2007; 
Helmert,  Haslum,  &  Hoffmann,  2007;  Katz  &  Domshlak,  2010a)  and  these  based  on  logical 
landmarks  for  goal  reachability  (Karpas  &  Domshlak,  2009;  Helmert  &  Domshlak,  2009; 
Domshlak,  Katz,  &  Lefler,  2012;  Bonet  &  Helmert,  2010a;  Pommerening  &  Helmert,  2013). 
Considering  OSP  as  heuristic  search,  a  question  is  then  whether  some  similar-in-spirit,  yet 
possibly  mathematically  different,  approximation  techniques  can  be  developed  for  heuristic- 
search  OSP.  This  is  precisely  the  question  we  study  here. 

•  Starting  with  the  most  basic  question  of  what  state-space  abstractions  for  OSP  ac¬ 
tually  are,  we  show  that  the  very  notion  of  abstraction  substantially  differs  between 
classical  planning  and  OSP.  Hence,  first  we  define  (additive)  abstractions  and  abstrac¬ 
tion  heuristics  for  OSP.  We  then  investigate  computational  complexity  of  deriving 
effective  abstraction  heuristics  in  the  scope  of  homomorphic  abstraction  skeletons, 
paired  with  cost,  value,  and  budget  partitions.  Along  with  revealing  some  significant 
islands  of  tractability,  this  study  exposes  an  interesting  interplay  between  knapsack- 
style  problems  of  combinatorial  optimization,  continuous  convex  optimization,  and 
certain  principles  borrowed  from  explicit  abstractions  for  classical  planning. 

•  We  introduce  and  study  e-landmarks,  the  logical  properties  of  OSP  plans  that  achieve 
valuable  states.  We  show  that  e-landmarks  correspond  to  regular  goal-reachability 
landmarks  of  certain  classical  planning  tasks  that  can  be  straightforwardly  derived 
from  the  OSP  tasks  of  interest.  We  then  show  how  such  e-landmarks  can  be  compiled 
back  into  the  OSP  task  of  interest,  resulting  in  an  equivalent  OSP  task,  but  with  a 
stricter  cost  satisfaction  constraint,  and  thus  with  a  smaller  effective  search  space. 
Finally,  we  show  how  such  landmark-based  task  enrichment  can  be  combined  in  a 
mutually  stratifying  way  with  the  BFBB  search  used  for  OSP  planning,  resulting 
in  an  incremental  procedure  that  interleaves  search  and  landmark  discovery.  The 

1.  The  connections  and  differences  between  some  popular  setups  of  deterministic  planning  are  discussed  in 

Section  2. 
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entire  framework  is  independent  of  the  OSP  planner  specifics,  and  in  particular,  of 
the  heuristic  functions  it  employs. 

Our  empirical  evaluation  on  a  large  set  of  OSP  tasks  confirms  the  effectiveness  of  the  pro¬ 
posed  techniques.  Also,  to  our  knowledge,  our  implementation  constitutes  the  first  domain- 
independent  solver  for  optimal  OSP,  and  we  hope  that  more  advances  in  this  important 
computational  problem  will  follow. 

This  work  is  a  revision  and  extension  of  the  formulations  and  results  presented  by  the 
authors  at  ICAPS-2013  and  ECAI-2014  (Mirkis  Sz  Domshlak,  2013,  2014).  The  paper  is 
structured  as  follows.  In  Section  2  we  formulate  a  general  model  of  deterministic  planning, 
define  several  variants  of  deterministic  planning  in  terms  of  this  model,  and,  in  particular, 
show  that  oversubscription  planning  differ  conceptually  not  only  from  classical  planning, 
but  also  from  other  popular  setups  of  deterministic  planning  such  as  net-benefit  planning 
and  cost-bounded  planning.  In  Section  2  we  also  specify  a  simple  model  representation 
language  for  OSP,  as  well  as  provide  the  essential  background  on  heuristic  search,  and, 
in  particular,  on  OSP  as  heuristic  search.  Sections  3  and  4  are  devoted  to  abstractions 
and  abstraction  approximations  for  OSP,  respectively.  Section  5  is  devoted  to  exploiting 
reachability  landmarks  in  OSP  tasks.  In  Section  6  we  conclude  and  discuss  some  promising 
directions  for  future  work.  The  paper  ends  with  two  appendices:  For  the  sake  of  readability, 
some  of  the  proofs  are  delegated  to  Appendix  A,  and  some  details  of  the  empirical  results 
are  delegated  to  Appendix  B. 

2.  Background 

As  we  already  mentioned  in  the  introduction,  specific  variants  of  deterministic  planning 
differ  in  the  way  the  interest  and  preference  over  trajectories  are  defined.  For  instance, 
in  “classical  planning”  (Fikes  &  Nilsson,  1971),  a  trajectory  is  of  interest  if  it  connects 
a  designated  initial  state  to  one  of  the  designated  goal  states,  with  the  preference  being 
towards  trajectories  with  lower  total  cost  of  the  transitions  along  them.  Among  other, 
“non-classical”  variants  of  deterministic  planning  are 

•  oversubscription  planning  (Smith,  2004),  the  topic  of  our  interest  here,  as  well  as 

•  net-benefit  planning  (Sanchez  &  Kambhampati,  2005;  Baier,  Bacchus,  &  Mcllraith, 
2007;  Bonet  Sz  Geffner,  2008;  Benton,  Do,  Sz  Kambhampati,  2009;  Coles  Sz  Coles, 
2011;  Keyder  Sz  Geffner,  2009), 

•  cost-bounded  (also  known  as  resource-constrained)  planning  (Haslum  Sz  Geffner,  2001; 
Hoffmann,  Gomes,  Selman,  Sz  Kautz,  2007;  Gerevini,  Saetti,  Sz  Serina,  2008;  Thayer  Sz 
Rurnl,  2011;  Thayer,  Stern,  Felner,  Sz  Ruml,  2012;  Haslum,  2013;  Nakhost,  Hoffmann, 
Sz  Muller,  2012),  and 

•  planning  with  preferences  over  temporal  properties  of  the  trajectories  (Baier  et  al., 
2007;  Baier,  Bacchus,  Sz  Mcllraith,  2009;  Gerevini,  Haslum,  Long,  Saetti,  Sz  Dimopou- 
los,  2009;  Benton,  Coles,  Sz  Coles,  2012). 
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Interestingly,  while  working  on  this  paper,  we  have  learned  that  quite  a  few  different 
variants  of  deterministic  planning  are  often  collectively  referred  to  as  “oversubscription  plan¬ 
ning”.  As  a  result,  the  difference  between  them  in  terms  of  expressiveness  is  not  necessarily 
clear,  and  thus,  the  positioning  of  what  we  do  here  with  respect  to  what  has  already  been 
done  in  the  collective  sense  of  “oversubscription  planning”  is  not  always  apparent.  This  is 
the  issue  we  are  going  to  address  first. 

2.1  Models 

Adopting  and  extending  the  notation  of  Geffner  and  Bonet  (2013),  many  variants  of  de¬ 
terministic  planning,  including  classical  planning,  as  well  as  many  popular  non-classical 
variants,  can  be  seen  as  special  cases  of  a  state  model 

M  =  (S,s0,u,O,ip,c,Q)  (1) 


with: 

•  a  finite  and  discrete  state  space  S, 

•  an  initial  state  so  G  S, 

•  a  state  value  function  u  :  S  ^  IR0+  U  {— oo}, 

•  operators  0(-s)  C  O  applicable  in  each  state  s  £  S, 

•  a  deterministic  state  transition  function  ip(s,  o )  such  that  s'  =  cp(s,  o )  stands  for  the 
state  resulting  from  applying  o  £  0(s )  in  s. 

•  an  operator  cost  function  c  :  O  -£  IR0+,  and 

•  a  quality  measure  Q  :  V  H >  IRU{— oo},  where  V  is  the  (infinite)  set  of  trajectories  from 
so  along  operators  O. 

In  this  model,  any  trajectory  7r  £  V  is  a  solution,  with  preference  being  towards  the  solutions 
of  higher  quality.  In  what  follows,  s[7r]  stands  for  the  the  end-state  of  a  trajectory  n 
applied  at  state  -s,  and  c(n)  =  )T)oe7r  c(°)  ^ie  additive  cost  of  tt.  Likewise,  by  graphical 
skeleton  Gm  =  { S ,  Tv,  O)  of  a  model  M  we  refer  to  the  edge-annotated,  unweighted  digraph 
induced  by  M  naturally  as  follows:  The  nodes  of  Gm  are  the  states  S,  the  edge  labels  are 
the  operators  O ,  and  Tv  contains  an  edge  from  s  to  s'  labeled  with  o  iff  o  £  0(s )  and 
s'  =  <p(s,o). 

First,  consider  a  quality  measure 

Q+(tt)  =  u(s[7r])  -  c(vr).  (2) 


This  measure  assumes  that  state  values  and  operator  costs  are  comparable,  and  thus  trades 
between  the  value  of  the  end-state  and  the  cost  of  the  trajectory.  Consider  now  a  fragment 
of  the  state  model  (1),  instances  of  which  all  have  the  quality  measure  Q+,  and  for  each 
instance,  the  value  function 


u{s) 


e, 

— oo, 


S  £  Sgoal 
otherwise 


(3) 
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Action  cost 


preference  constraint 


Net  Benefit 

Oversubscription 

Classical 

Cost-bounded 

Figure  1:  Schematic  classification  of  four  deterministic  planning  models  along  the  strictness 
with  which  they  approach  the  cost  of  operator  sequences  and  the  value  of  the 
operator  sequence  end-states.  White  blocks  are  for  planning  models  that  can  be 
solved  as  single-source  single-target  shortest  path  problems. 


partitions  the  state  space  into  Sgoai  C  S,  on  which  u  takes  a  finite  value  e  >  0,  and  the 
rest  of  the  states,  on  which  u  takes  the  value  of  — oo.  Finding  an  optimal  solution  for  an 
instance  M  of  this  fragment  corresponds  to  finding  a  shortest  path  from  so  to  a  single  node 
s*  in  an  edge-weighted  digraph  G,  which  is  obtained  from  Gm  by  (i)  annotating  the  edges 
of  the  latter  with  costs  c,  and  (ii)  adding  a  dummy  node  s*  and  zero-cost  edges  from  all 
goal  nodes  s  G  Sgoai  to  s*.  While  specified  in  a  non-canonical  way,  it  is  not  hard  to  verify 
that  this  fragment  corresponds  to  the  model  of  classical  planning ,  with  Sgoai  being  the  so 
called  goal  states. 

Staying  with  the  quality  measure  Q+  and  removing  now  the  requirement  on  u  to  com¬ 
ply  with  Eq.  3,  we  obtain  a  fragment  that  generalizes  classical  planning,  and  constitutes 
the  basic  model  of  what  is  called  net-benefit  planning  (Sanchez  &  Kambhampati,  2005). 
Importantly,  as  it  was  noticed  by  Keyder  and  Geffner  (2009)  (in  technically  different,  yet 
semantically  similar  terms),  any  instance  M  of  this  fragment  can  be  reduced  to  finding 
a  shortest  path  from  a  single  node  so  to  a  single  node  s*  in  an  edge-weighted  digraph  G , 
obtained  from  Gm  by  (i)  annotating  edges  of  Gm  with  costs  c,  (ii)  adding  a  dummy  node  s* 
and  edges  from  all  nodes  s  6  5  to  s„  and  (ii)  setting  the  cost  of  each  such  new  edge  (s,  s*) 
to  ,g s\{s}u(s')-  bi  particular,  this  equivalence  between  classical  and  basic  net-benefit 
planning  at  the  level  of  the  computational  model  allowed  Keyder  and  Geffner  (2009)  to  show 
how  certain  standard  representation  formalisms  for  net-benefit  planning  can  be  efficiently 
compiled  to  a  standard  classical  planning  formalism. 

Consider  now  an  alternative  quality  measure 


m(sW),  c(7t)  <  b 
— oo,  otherwise  ’ 


(4) 


5 


where  b  6  IR0+  is  a  predefined  bound  on  the  cost  of  the  trajectories.  The  fragment  of  the 
basic  model,  instances  of  which  are  characterized  by  having  the  quality  measure  Qb  and 
the  “e  or  —  oo”  value  functions  as  in  Eq.  3,  constitutes  the  model  of  what  is  called  cost- 
bounded,  planning  (Thayer  &  Rurnl,  2011).  Here  as  well,  finding  an  optimal  solution  for  a 
problem  instance  M  corresponds  to  finding  a  shortest  path  from  sq  to  s*  in  an  edge- weighted 
digraph  G,  which  is  derived  from  Gm  identically  to  the  case  of  classical  planning.2  This, 
in  particular,  explains  why  it  is  only  natural  for  heuristic-search  methods  for  cost-bounded 
planning  to  exploit  heuristics  developed  for  classical  planning  (Haslum,  2013). 

We  now  arrive  to  a  forth  fragment  of  the  basic  model.  Staying  with  the  quality  mea¬ 
sure  Qb  and  removing  the  requirement  on  u  to  to  comply  with  Eq.  3,  we  obtain  a  fragment 
that  generalizes  cost-bounded  planning,  and  constitutes  the  model  of  oversubscription  plan¬ 
ning  (Smith,  2004).  As  it  is  illustrated  by  Figure  1,  the  hard  constraint  of  classical  planning 
translates  to  soft  preference  in  OSP,  and  the  hard  constraint  of  OSP  translates  to  soft  pref¬ 
erence  in  classical  planning.  However,  in  contrast  to  cost-optimal,  net-benefit,  and  classical 
planning,  this  fragment  does  not  appear  to  be  reducible  to  the  single-source  single-target 
shortest  path  problem.  In  terms  of  the  digraph  G  obtained  from  Gm  by  annotating  the 
edges  with  costs  c,  finding  an  optimal  solution  to  an  instance  of  oversubscription  planning 
requires  (i)  finding  shortest  paths  from  so  to  all  states  s  £  S  with  u(s)  >  0,  (ii)  filtering 
out  from  these  states  those  are  not  reachable  from  so  within  the  cost  allowance  b,  and  (iii) 
selecting  from  the  remaining  states  a  state  that  maximises  u. 

This  contrast  between  oversubscription  planning  and  all  the  three  other  popular  variants 
of  deterministic  planning  discussed  above  has  at  least  two  important  implications.  First, 
while  a  single  shortest  path  can  be  searched  for  using  best-first  forward  search  procedures 
such  as  A*,  searching  for  shortest  paths  to  numerous  targets  simultaneously  requires  a 
different,  more  exhaustive,  forward  search  framework  such  as  branch-and-bound.  Second, 
net-benefit  and  cost-bounded  planning  clearly  have  the  potential  to  (directly  or  indirectly) 
reuse  the  rich  toolbox  of  heuristic  functions  that  have  been  developed  over  the  years  for 
classical  planning.  In  contrast,  due  to  the  differences  in  the  underlying  computational 
model,  the  same  is  not  necessarily  true  for  oversubscription  planning,  and  examining  this 
issue  is  precisely  the  focus  of  our  work  here. 

2.2  Notation 

For  k  €  lhl+,  by  [k]  we  denote  the  set  {1,2,...,  k}.  By  ||x||  we  refer  to  the  representation 
size  of  object  x,  not  to  be  confused  with  |a:|,  which  denotes  the  number  of  elements  in  set  x. 
An  assignment  of  a  variable  v  to  value  d  is  denoted  by  (v/d)]  we  often  refer  to  such  single 
variable  assignments  as  propositions. 

2.3  Model  Representation 

Departing  from  a  very  general  model  of  oversubscription  planning,  in  what  follows  we  re¬ 
strict  our  attention  to  instances  of  that  model  that  are  compactly  representable  in  a  language 
close  to  the  SAS+  language  for  classical  planning  (Backstrom  &  Klein,  1991;  Backstrom  Sz 

2.  To  be  entirely  precise  here,  once  a  shortest  path  it  from  so  to  s*  is  found,  it  should  still  be  checked 
against  the  cost  bound  b.  This  test,  however,  is  local  to  n,  and  problem  solving  finishes  independently 
of  the  test’s  outcome. 
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Nebel,  1995).  In  this  language,  a  deterministic  oversubscription  planning  (OSP)  task 
is  given  by  a  sextuple 


(' V,s0,u;O,c,b ),  (5) 

where 

(1)  V  =  {vi, . . .  ,vn}  is  a  finite  set  of  finite-domain  state  variables,  with  each  complete 
assignment  to  V  representing  a  state,  and  S  =  dom(v i)  x  •  •  •  x  dom(yn )  being  the  state 
space  of  the  task; 

(2)  so  £  S  is  a  designated  initial  state; 

(3)  u  is  an  efficiently  computable  state  value  function  u  :  S  — >  IR0+; 

(4)  O  is  a  finite  set  of  operators,  with  each  operator  o  G  O  being  represented  by  a  pair 
(pre(o),  eff(o)}  of  partial  assignments  to  V,  called  preconditions  and  effects  of  o,  respec¬ 
tively; 

(5)  a  :  O  — >  IR0+  is  an  operator  cost  function; 

(6)  b  G  IR0+  is  a  cost  budget  allowed  for  the  task. 

Mapping  a  task  description  in  this  language  into  our  basic  model,  an  OSP  task  II  = 
( V,  so,u;0,c,b )  induces  the  model  Mjj  =  (S,so,u,0,ip,c,Qb),  with  Qh  being  the  quality 
measure  (4)  instantiated  with  the  IPs  budget  b,  and  the  transition  function  ip  being  specified 
as  follows.  Let  V(p)  CV  denote  the  subset  of  variables  instantiated  by  a  partial  assignment 
p.  Similarly  to  the  classical  planning  semantics  of  SAS+,  operator  o  is  applicable  in  a  state  s 
iff  s[u]  =  pre(o)[u]  for  all  v  G  V(pre(o)).  Applying  o  changes  the  value  of  each  v  G  V(eff(o)) 
to  eff(o)[u],  and  the  resulting  state  is  denoted  by  s  [o] .  This  notation  is  only  defined  if  o 
is  applicable  in  s.  Applying  a  sequence  of  operators  (oi,...,om)  to  a  state  s  is  defined 
inductively  as  s[[e]  :=  s  and  s[oi, . . . ,  off  :  =  s[oi, . . . ,  —  l ]  ]  •  An  operator  sequence  ir  is 

called  an  s-plan  if  it  is  applicable  in  state  s  and  Qh{ tt)  — oo,  that  is,  c( n)  <  b. 

For  an  OSP  task  II  =  (V,  so,u;  O,  c,b) ,  by  V  =  \JveV  dom(v)  we  denote  the  union  of 
the  (uniquely  labeled)  state- variable  domains.  For  a  state  s  and  a  proposition  (v/d)  G  V, 
(v/d)  G.  s  is  used  as  a  shortcut  notation  for  s[u]  =  d. 

2.4  OSP  as  Heuristic  Search 

The  two  major  ingredients  of  any  heuristic-search  planner  are  its  search  algorithm  and 
heuristic  function.  In  classical  planning,  the  heuristic  is  typically  a  function  h  :  S  — >  IR0+  U 
{oo},  with  h(s)  estimating  the  cost  h*(s)  of  optimal  s-plans.  A  heuristic  h  is  admissible 
if  it  is  lower-bounding ,  that  is,  h(s)  <  h*(s)  for  all  states  s.  All  common  heuristic  search 
algorithms  for  optimal  classical  planning,  such  as  A*,  require  admissible  heuristics. 

In  contrast,  a  heuristic  in  OSP  is  a  function  h  :  S  x  IR0+  — >  IR0+,  with  h(s,  b )  estimating 
the  value  h*(s,b )  of  optimal  s-plans  under  cost  budget  b.  A  heuristic  h  is  admissible  if 
it  is  upper-bounding,  that  is,  h(s,b )  >  h*(s,b)  for  all  states  s  and  all  cost  budgets  b.  Here 
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BFBB  (n  =  (V,s0,u-,O,c,b)) 

open  :=  new  max-heap  ordered  by  f(n)  =  h(s[n],  b  —  g(n)) 

initialize  best  solution  n*  :=  make-root-node(so) 

open,  insert  (make- root-node  (n*)) 

closed:=  0; 

best-cost:  =  0 

while  not  open.emptyQ 

n  :=  open.pop-max() 
if  f(n)  <  u(s[n*]):  break 
if  s[n]  0  closed  or  g(n)  <  best-cost(s[n]): 
closed:=  closed  U  {s[n]} 
best-cost(s[n])  :=  g(n) 
foreach  o  £  0(s[n]): 

n!  :=  make-node(s[n][[o]) 
if  g(n')  >  b  or  f(n')  <  u(s[n*}):  continue 
if  u(s[n'])  >  tt(s[ra*]):  update  n*  :=  n' 
open.insert(n/) 

return  n* 


Figure  2:  Best-first  branch-and-bound  (BFBB)  search  for  OSP 


as  well,  search  algorithms  for  optimal  OSP,  such  as  best-first  branch-and-bound  (BFBB),3 
require  admissible  heuristics,  and  this  for  pruning  search  branches  without  violating  solution 
optimality. 

Figure  2  depicts  a  pseudo-code  description  of  BFBB  for  OSP.  s[n]  there  denotes  the  state 
associated  with  search  node  n.  Unlike  in  A*,  the  order  in  which  the  nodes  are  selected  from 
the  OPEN  list  does  not  affect  the  optimality  guarantees  (though,  of  course,  may  seriously 
affect  the  empirical  efficiency  of  the  search).  In  Figure  2,  the  ordering  of  OPEN  corresponds 
to  the  decreasing  order  of  h(s[n\,  b  —  g(n)).  The  duplicate  detection  and  reopening  mecha¬ 
nisms  in  BFBB  are  similar  to  those  in  A*  (Pearl,  1984).  In  addition,  BFBB  maintains  the 
best  solution  n*  found  so  far  and  uses  it  to  prune  all  generated  nodes  evaluated  no  higher 
than  ii(s[n*]).  Likewise,  complying  with  the  semantics  of  OSP,  all  generated  nodes  n  with 
cost-so-far  g(n)  higher  than  the  problem’s  budget  b  are  also  immediately  pruned.  When  the 
OPEN  list  becomes  empty  or  the  node  n  selected  from  the  list  promises  less  than  the  lower 
bound,  BFBB  returns  (the  plan  associated  with)  the  best  solution  n*.  If  h  is  admissible, 
that  is,  the  h- based  pruning  of  the  generated  nodes  is  sound,  then  the  returned  plan  is 
guaranteed  to  be  optimal. 

Returning  now  to  the  heuristic  functions,  in  domain-independent  planning  they  should 
be  automatically  derived  from  the  description  of  the  model  in  the  language  of  choice.  A 
useful  heuristic  function  must  be  both  efficiently  computable  from  the  description  of  the 
model,  as  well  as  relatively  accurate  in  its  estimates.  Improving  the  accuracy  of  a  heuristic 

3.  BFBB  is  also  extensively  used  for  net-benefit  planning  (Benton,  van  den  Brief  &  Kambhampati,  2007; 

Coles  &  Coles,  2011;  Do,  Benton,  van  den  Briel,  &  Kambhampati,  2007),  as  well  as  some  other  variants 

of  deterministic  planning  (Bonet  &  Geffner,  2008;  Brafman  &  Chernyavsky,  2005). 


function  without  substantially  worsening  the  time  complexity  of  computing  it  translates 
into  faster  search  for  plans. 

In  classical  planning,  numerous  approximation  techniques,  such  as  monotonic  relax¬ 
ation  (Bonet  &  Geffner,  2001,  2001;  Hoffmann  &  Nebel,  2001),  critical  trees  (Haslurn  & 
Geffner,  2000),  network  flow  (van  den  Briel,  Benton,  Kambhampati,  &  Vossen,  2007;  Bonet, 
2013),  logical  landmarks  for  goal  reachability  (Richter,  Helmert,  &  Westphal,  2008;  Karpas 
&  Domshlak,  2009;  Helmert  &  Domshlak,  2009;  Bonet  &;  Helmert,  2010a),  and  abstrac¬ 
tions  (Edelkamp,  2001;  Helmert  et  al.,  2007;  Katz  &  Domshlak,  2010a),  have  been  trans¬ 
lated  to  effective  heuristic  functions.  Likewise,  different  heuristics  for  classical  planning  can 
also  be  combined  into  their  point-wise  maximizing  and/or  additive  ensembles  (Edelkamp, 
2001;  Haslurn,  Bonet,  Sz  Geffner,  2005;  Coles,  Fox,  Long,  Sz  Smith,  2008;  Katz  &  Domshlak, 
2010b;  Helmert  &  Domshlak,  2009). 

In  contrast,  development  of  heuristic  functions  for  OSP  has  not  progressed  beyond  the 
initial  ideas  of  Smith  (2004).  In  principle,  the  reduction  of  Keyder  and  Geffner  (2009)  from 
net-benefit  to  classical  planning  can  be  used  to  reduce  OSP  to  classical  planning  with  real¬ 
valued  state  variables  (Koehler,  1998;  Helmert,  2002;  Fox  &  Long,  2003;  Hoffmann,  2003; 
Gerevini,  Saetti,  &  Serina,  2003;  Gerevini  et  al.,  2008;  Edelkamp,  2003;  Dvorak  &  Bartak, 
2010;  Coles,  Coles,  Fox,  Sz  Long,  2013).  So  far,  however,  progress  in  heuristic-search  classical 
planning  with  numeric  state  variables  has  mostly  been  achieved  around  direct  extensions 
of  delete  relaxation  heuristics  via  “numeric  relaxed  planning  graphs”  (Hoffmann,  2003; 
Edelkamp,  2003;  Gerevini  et  al.,  2003,  2008).  Unfortunately,  these  heuristics  do  not  preserve 
information  on  consumable  resources  such  as  budgeted  operator  cost  in  oversubscription 
planning:  The  “negative”  action  effects  that  decrease  the  values  of  numeric  variables  are 
ignored,  possibly  up  to  some  special  handling  of  so-called  “cyclic  resource  transfer”  (Coles 
et  al.,  2013). 

Approaching  the  deficit  in  effective  heuristics  for  OSP,  in  the  next  section  we  study 
abstractions  for  OSP,  from  their  very  definition  and  properties,  to  the  prospects  of  deriving 
admissible  abstraction  heuristics.  In  Section  5  we  then  study  the  prospects  of  adapting  to 
OSP  the  toolbox  of  logical  landmarks  for  goal  reachability.  To  date,  abstractions  and  land¬ 
marks  are  responsible  for  most  state-of-the-art  admissible  heuristics  for  classical  planning, 
and  thus  are  of  our  special  interest  here. 

3.  Abstractions 

The  term  “abstraction”  is  usually  associated  with  simplifying  the  original  model,  factoring 
out  details  less  crucial  in  the  given  context.  Which  details  can  be  reduced  and  which 
should  better  be  preserved,  as  well  as  how  the  abstraction  is  created  and  used,  depends 
largely  on  the  context  (Cousot  Sz  Cousot,  1992;  Clarke,  Grumberg,  Sz  Peled,  1999;  Helmert 
et  al.,  2007;  Domshlak,  Hoffmann,  Sz  Sabharwal,  2009;  Katz  &  Domshlak,  2010b).  In 
general  terms,  abstracting  a  model  M  corresponds  to  associating  it  with  a  set  of  (typically 
computationally  more  attractive)  models  M± , ,  M ^  such  that  solutions  to  these  models 
satisfy  certain  properties  with  respect  to  the  solutions  of  M.  In  particular,  in  deterministic 
planning  as  heuristic  search,  abstractions  are  used  to  derive  heuristic  estimates  for  the  states 
of  the  model  of  interest  M:  Given  a  state  s  of  M  and  an  abstraction  M\ , . . . ,  M/,., 

(1)  s  is  mapped  to  some  “abstract  states”  -si  G  Mi, . . . ,  Sk  G  M&, 
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(2)  the  k  models  of  the  abstraction  are  solved  for  the  respective  initial  states  si,...,Sk, 
and 

(3)  an  aggregation  of  the  quality  of  the  resulting  k  solutions  is  used  as  the  heuristic  estimate 
for  s. 

Sometimes  schematically  and  sometimes  precisely,  the  process  of  constructing  abstrac¬ 
tions  as  above  for  a  state  model  M  =  ( S ,  so,  u ,  O,  ip,  c,  Q)  can  be  seen  as  a  two-step  process 
of 

(1)  Selecting  an  abstraction  skeleton  yLS  =  {(Gi,  a\), . . . ,  (G*,,  «&)},  where  each  pair 
(Gi,cti)  comprises  an  edge-labeled  digraph  Gj  =  { Si,Ti,Oi ),  with  nodes  Si,  edges  Tt, 
and  edge  labels  O*,  and  a  state  mapping  a*  :  S  — >  Si. 

(2)  Extending  A5  to  a  set  of  abstract  models  Ai  =  {Mi, . . . ,  M *,},  such  that,  for  i  E  [k], 
Gi  is  the  graphical  skeleton  Gm \  of  M*. 

To  be  qualified  as  a  valid  abstraction  of  the  model  M ,  the  resulting  set  of  abstract  models 
At  should  satisfy  certain  conditions  specific  to  the  variant  of  the  deterministic  planning 
under  consideration.  For  instance,  the  optimal  solutions  of  abstract  models  in  classical 
planning  are  required  to  be  at  most  as  costly  as  the  respective  solutions  in  the  original 
models,  with  that  constraint  to  be  satisfied  by  individual  abstract  models  in  case  of  max- 
aggregation  (Pearl,  1984),  or  by  the  k  abstract  models  jointly,  in  case  of  additive  abstrac¬ 
tions  (Yang,  Culberson,  Holte,  Zahavi,  &  Felner,  2008;  Katz  &  Domshlak,  2010b).  As  we 
now  show,  the  concept  of  abstractions  in  general,  and  additive  abstractions  in  particular, 
is  very  different  in  OSP,  and,  for  better  and  for  worse,  has  many  more  degrees  of  freedom 
than  the  respective  concepts  in  classical  planning. 

3.1  Abstractions  of  OSP  Problems 

Given  an  abstraction  skeleton  *45  =  {(Ga,  aq), . . . ,  (G^,  a &)}  for  an  OSP  state  model  M  = 
(S,so,u,0,tp,c,Qb'),  each  digraph  Gi  =  ( Si,Ti,Oi )  implicitly  defines  a  set  of  OSP  state 
models  consistent  with  it.  This  set  is  given  by  Ci  x  Ui  x  Bi  where  Ct  is  the  set  of  all 
functions  from  operators  Oi  to  IR0+,  Ui  is  the  set  of  all  functions  from  states  Si  to  R0+, 
and  Bi  =  IR0+.  In  these  terms,  each  point  ( c,u,b )  G  Ci  x  Ui  x  Bi  induces  an  OSP  model 
consistent  with  Gi,  and  vice  versa. 

Connecting  between  these  sets  of  models  for  all  the  digraphs  in  A5,  let 

C  =  Ci  x  •  •  •  x  Ck, 

U  =  Ui  x  ■  ■  ■  x  Uk, 

B  =  Bi  x  •  •  •  x  B 

For  each  state  s  E  M,  every  point  (c,  u,  b)  G  C  x  U  x  B  induces  a  set  of  models 

A4(c’u’b)  =  {M1(c’Ulb), . . . ,  A4c’u’b)}  , 

with  Mfc,u,b)  =  (Si,ai(s0),u[i\,Oi,ipi,c[i},QbM)  : 
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(a) 


Figure  3:  Illustration  for  our  running  example 


-  the  states  Si  and  operators  Oi  correspond  to  the  nodes  and  edge  labels  of  Gf, 

-  the  transition  function  ipi(s,o)  =  s'  iff  Tt  contains  an  arc  from  from  s  to  s'  labeled 
with  o  £  Oi\ 

-  the  initial  state  aj(so)  is  determined  by  the  initial  state  so  and  the  state  mapping  a*; 
and 

-  the  operator  cost  function,  state  value  function,  and  cost  budget  are  all  directly  de¬ 
termined  by  the  choice  of  (c,  u,  b). 

For  some  choices  (c,  u,  b)  from  C  x  U  x  B,  the  induced  sets  of  models  _A/^c,u'b'  can  be 
used  for  deriving  admissible  estimates  for  the  state  of  interest  so,  while  other  cannot.  The 
respective  qualification  is  defined  below. 

Definition  1  (Additive  OSP  Abstraction) 

Let  M  =  (S,  so,  u,  O,  99,  c,  Qh^  be  an  OSP  model,  and  AS  =  {(Gi,  «i), . . . ,  {G\,  «&)}  be 
an  abstraction  skeleton  for  M .  For  (c,  u,  b)  £  C  x  U  x  B,  c,u,b)  is  an  (additive) 
abstraction  for  M,  denoted  as 

M(c’u'b)  £As  M, 


if  and  only  if 

h*(so,  b)  <  hM(c,vL,h)  (s0,  b)  =  ^2  h*(ai(s0),b[i]), 

i£[k] 

that  is,  when  hM(c,u,b)(so,b)  is  an  admissible  estimate  of  h*(so,b). 

In  simple  terms,  a  set  of  models  forms  an  additive  OSP  abstraction  if  jointly  it  does 
not  underestimate  the  value  that  can  be  obtained  from  the  initial  state,  within  the  given 
cost  budget.  For  example,  let  Gm  hi  Figure  3a  be  the  graphical  skeleton  of  a  state  model 
M  =  ({so,  •  •  • ,  S4},  si,  u,  {01, . . . ,  05},  (p,  c,  Qb),  with  c{oi )  =  1  for  all  operators  Oi,  b  =  2, 
and  u(si)  =  l/j=4\.  Let  AS  =  {(Gi,«i),  (CAa^)}  be  an  abstraction  skeleton  for  M,  with 
G 1  and  G2  as  in  Figure  3b  and  with  state  mappings 


Ol(Sj) 

Ol2(Si) 


S4,  *e{l,3} 
s),  otherwise  ’ 


s\,  i  =  2 
sf,  otherwise 
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(c--.-)  (~,-,b) 

X  X 

(c,  u,  — )  (— ,  u,  b)  (c,  — ,  b) 


(c,  u,  b) 


Figure  4:  Fragments  of  restricted  optimization  over  the  abstractions  ACCxUx  B 

Consider  a  set  of  models  .AAc,u,b\  with  constant  c[l] (•)  =  c[2](-)  =  1,  b[l]  =  b[2]  =  2,  and, 
for  j  G  [2],  u[i](s^)  =  lri=5i .  The  optimal  plan  so-plan  for  M  is  ir  =  ((so,  02,  S2),  (s2>  04,  S4)}, 

with  Qb(ir )  =  1,  while  the  optimal  ai(so)-plan  for  M[c'u'b^  is  vri  =  ^(sg,  01, S4)),  with 

QbW  (7Ti)  =  1,  and  the  optimal  «2(so)-plan  for  M2c,u,h^  is  7T2  =  ((sq,  02,  s 4)),  with  Qh ^ (^2)  = 

1.  Since 

h*(s0,b)  =  Qb(ir)  <  Qb[1](7Ti)  +  Qb[2l(7r2)  =  hl(ai(s0),  b[l])  +  h2(a2{s0),  b[2]), 

■M(c’u’b)  is  an  additive  abstraction  for  M . 

Theorem  1  For  any  OSP  task  FI  =  (' V,so,u;0,c,b ),  any  abstraction  skeleton  AS  = 
{(Gi,  «i), . . . ,  (Gk,  «fc)}  of  Mu,  and  any  M  £^45  Mu,  if  the  digraphs  of  AS  are  given 
explicitly,  then  h_M(so,b)  can  be  computed  in  time  polynomial  in  ||II||  and  ||A4||. 

Proof:  The  proof  is  straightforward.  Let  M.  =  rfci,  with  M*  =  (Si,  ai(so),Ui ,  Oi,  ipi,  Ci ,  Qbi ), 

be  an  additive  abstraction  for  Mu  on  the  basis  of  AS.  For  i  G  [k],  let  S'-  =  {s  E  S',  | 
Ci(o!i(so))  s)  <  6i}.  Since  the  digraphs  of  AS  are  given  explicitly,  computing  shortest 
paths  from  at(so)  to  all  states  in  Gi,  and  thus  computing  S’-,  can  be  done  in  time  poly¬ 
nomial  in  ||A4||  for  all  i  G  [k].  In  turn,  since  h*(on(so),  bf)  =  maxsgS/  Uj(s),  computing 
Jim  («o,  b )  =  J2ie[k}  K («i(so),  h )  is  polynomial  time  in  \\M  \  \ .  □ 

The  message  of  Theorem  1  is  positive,  yet  it  establishes  only  a  necessary  condition  for 
the  relevance  of  OSP  abstractions  to  practice.  Given  an  OSP  task  II,  and  having  fixed  an 
abstraction  skeleton  *45  with  a  joint  performance  measure  space  C  x  U  x  B,  we  should 
be  able  to  automatically  separate  between  those  (c,  u,  b)  G  C  x  U  x  B  that  constitute 
abstractions  for  Mu  and  those  that  do  not,  and  within  the  former  set,  denoted  as 

ACCxUxB, 

home  in  on  an  abstraction  that  provides  us  with  as  accurate  (aka  as  low)  an  estimate  of 
h* (so,  b)  as  possible.  Here,  even  the  first  item  on  the  agenda  is  not  necessarily  trivial  as,  in 
general,  A  seem  to  lack  convenient  combinatorial  properties.  For  instance,  generally  A  does 
not  form  a  combinatorial  rectangle  in  C  x  U  x  B:  Consider  the  OSP  state  model  Gm  and 
abstraction  skeleton  A5  from  our  running  example.  Let  c  G  C  be  a  cost  function  vector  with 
both  c[l]  and  c[2]  being  constant  functions  with  value  of  1,  and  two  performance  measures 
(c,  u,  b),  (c,u',b')  G  C  x  U  x  B  being  defined  via  budget  vectors  b  =  {b[l]  =  2,b[2]  =  0} 
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Figure  5:  Homomorphic  abstraction  skeleton  for  G(n)  in  Figure  3 

and  b'  =  {b7 [1]  =  0,b'[2]  =  2},  and  value  function  vectors  u  and  u',  with  u[l],  u[2],  u'[l], 
and  u'[2]  evaluating  to  zero  on  all  states  except  for  u[l](sg)  =  u'^Ksg)  =  1.  It  is  not  hard 
to  verify  that  £.45  M ,  yet  b),  Af(c’u’b')  gAS  M. 

Taking  that  on  board,  we  break  down  and  approach  the  overall  agenda  of  complexity 
analysis  of  abstraction-based  heuristic  functions  under  fixation  of  some  of  the  three  dimen¬ 
sions  of  A:  If,  for  instance,  we  are  given  a  vector  of  value  functions  u  that  is  known  to 
belong  to  the  projection  of  A  on  U,  then  we  can  search  for  a  quality  abstraction  from 
the  abstraction  subset  A(— ,  u,  — )  C  A,  corresponding  to  the  projection  of  A  on  {u}.  As 
we  show  below,  even  some  constrained  optimizations  of  this  kind  can  be  challenging.  The 
lattice  in  Figure  4  depicts  the  range  of  options  for  such  constrained  optimization;  at  the 
extreme  settings,  A(— ,— )  is  simply  a  renaming  of  A,  and  A(c,  u,  b)  corresponds  to  a 
single  abstraction  _AAc,u'b)  £  A. 

3.2  Partitions  and  Homomorphic  Abstractions 

We  now  proceed  with  considering  a  specific  family  of  additive  abstractions,  reveal  some 
of  its  interesting  properties,  and  show  that  it  contains  substantial  islands  of  tractability. 
With  Definition  1  allowing  for  very  general  abstraction  skeletons,  in  this  work  we  focus  on4 
homomorphic  abstraction  skeletons  (Helmert  et  ah,  2007). 

Definition  2  An  abstraction  skeleton  AS  =  {(Gi,  on), . . . ,  (G&,  a*,)}  for  an  OSP  state 
model  M  =  (S,  sq,  u,  O,  ip,  c,  Qb )  is  homomorphic  if,  for  i  6  [k],  0\  =  O,  and  ip(s,  o )  =  s' 
only  if  (ai(s),o,ai(s'))  £  T*. 

For  instance,  in  our  running  example,  the  abstraction  skeleton  in  Figure  3b  is  not  homo¬ 
morphic,  while  the  abstraction  skeleton  in  Figure  5  is  homomorphic.  Furthermore,  we  focus 
on  a  fragment  of  additive  abstractions 

Ap  =  A  n  [Cp  x  Up  x  Bp]  , 

where  Cp  C  C,  Up  C  U,  and  Bp  C  B  correspond  to  cost ,  value ,  and  budget  partitions , 
respectively. 

Definition  3  Given  an  OSP  state  model  M  =  (S,so,u,0,(p,c,Qb),  and  a  homomorphic 
abstraction  skeleton  AS  =  {(Gi,  «i), . . . ,  (Gk,  04,)}  for  M  with  a  joint  performance  measure 

C  x  U  x  B. 

4.  All  the  results  also  hold  verbatim  for  the  more  general  “labeled  paths  preserving”  abstraction  skeletons 
studied  by  Katz  and  Domshlak  (2010b)  in  the  context  of  optimal  classical  planning.  However,  the 
presentation  is  somewhat  more  accessible  when  restricted  to  homomorphic  abstraction  skeletons. 
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Figure  6:  Illustration  for  sub-claims  (1)  and  (2)  of  Theorem  2:  In  (1),  the  gray  ellipse 
within  Bp  stands  for  the  subset  of  budget  partitions  b  that  pair  with  c  in  some 
abstraction,  that  is,  Ap(c,  — ,  b)  /  0.  However,  while  pairing  some  of  these  budget 
partitions  b  with  c  requires  then  a  careful  selection  of  a  value  partition  u  (so  that 
wju  ke  an  abstraction),  there  exists  some  budget  partition  b*  for  which 
any  choice  of  u  will  do  the  job. 


•  c  £  C  is  a  cost  partition  iff,  for  each  operator  o  £  O,  X^e[fc]  C[*K°)  —  c(o); 

•  u  £  U  is  a  value  partition  iff,  for  each  state  s  £  S,  X^e[fc]  u[*Ka»(s))  >  u(s);  and 

•  b  £  B  is  a  budget  partition  iff,  Ylie[k]  b[i]  —  & • 

In  what  follows,  for  any  node  x  of  the  lattice  in  Figure  4,  by  Ap(x)  we  refer  to  A(x)fl  Ap; 
e.g.,  Ap(— ,  u,  -)  =  A(— ,  u,  -)  n  Ap. 

We  begin  our  analysis  of  Ap  by  establishing  an  interesting  “completeness”  relationship 
between  the  sets  Cp  and  Bp,  as  well  as  an  even  stronger  individual  “completeness”  of  Cp  and 
Bp.  Formulated  in  Theorem  2,  these  properties  of  Ap  play  a  key  role  in  our  computational 
analysis  later  on. 

Theorem  2  Given  an  OSP  task  n  =  ( V,so,w,0,c,b )  and  a  homomorphic  abstraction 
skeleton  AS  =  {(Gb,  «i), . . . ,  (Gk,  ak)}  of  Mu, 

(1)  for  each  cost  partition  c  £  Cp,  there  exists  a  budget  partition  b*  £  Bp  such  that 
At  (c,u,b  )  A5  /or  an  vaiue  partitions  u  £  Up. 

(2)  for  each  budget  partition  b  £  Bp,  there  exists  a  cost  partition  c*  £  Cp  such  that 
At  (c*  ,u,b)  A5  for  all  value  partitions  u  £  Up. 

The  proof  of  Theorem  2  appears  in  Appendix  A,  p.  44.  Figure  6  illustrates  the  statement 
of  sub-claim  (1)  of  Theorem  2,  as  well  as,  indirectly,  some  of  its  corollaries.5  The  first 
corollary  of  Theorem  2  is  that  the  projections  of  Ap  on  Cp,  Up,  and  Bp  are  the  entire  sets  Cp, 

5.  The  respective  illustration  of  sub-claim  (2)  of  Theorem  2  is  completely  similar,  mutatis  mutandis. 
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Up,  and  Bp,  respectively.  That  is,  any  cost  partition  c  (and  similarly,  any  budget  partition 
and  any  value  partition)  can  be  matched  with  an  abstraction  that  has  that  partition  as 
its  component.  Second,  while  not  any  budget  partition  b  can  be  paired  with  a  given  cost 
partition  c  in  abstractions  for  Mn,  that  is,  not  for  all  b  £  Bp,  Ap(c,  — ,  b)  0,  there  are 
always  some  budget  partitions  that  can  be  paired  with  c.  Finally,  while  pairing  some  of 
these  “c-compatible”  budget  partitions  b  with  c  requires  then  a  careful  selection  of  a  value 
partition  u,  there  exists  some  “c-compatible”  budget  partition  b*  for  which  any  choice  of 
u  will  result  in  _A/j(c,u,b')  being  an  abstraction  of  Mn- 

A  priori,  these  properties  of  Ap  should  simplify  the  task  of  abstraction  discovery  and 
optimization  within  the  space  of  partitions  Cp  x  Up  x  Bp,  and  later  we  show  that  this 
is  indeed  the  case.  However,  complexity  analysis  of  abstraction  discovery  within  Cp  x 
Up  x  Bp  in  most  general  terms  is  still  problematic  because  OSP  formalism  is  parametric  in 
the  representation  of  value  functions.  Hence,  here  we  proceed  with  examining  abstraction 
discovery  for  OSP  in  the  context  of  fixed  value  partitions  u  £  Up. 

4.  From  Value  Partitions  to  Complete  Abstractions 

Let  n  be  an  OSP  task,  A5  be  an  explicitly  given  homomorphic  abstraction  skeleton  of  Mn, 
and  u  £  Up  be  a  value  partition  over  A5.  An  immediate  corollary  of  Theorem  2  is  that 
Ap(— ,  u,  — )  is  not  empty,  and  thus  we  can  try  computing  min(c  u ,b)eAp(-,u,-)  (so)- 

As  of  yet,  however,  we  do  not  know  whether  this  task  is  polynomial-time  solvable  for  any 
non-trivial  class  of  value  partitions.  In  fact,  despite  that,  by  Theorem  2,  Ap(— ,  u,  — )  is 
known  to  be  non-empty,  and  so,  too,  are  all  of  its  subsets  Ap(— ,  u,  b)  and  Ap(c,  u,  — ), 
finding  even  just  any  abstraction  (c,  u,  b)  £  Ap(— ,  u,  — )  is  not  necessarily  easy. 

4.1  0-Binary  Value  Partitions 

As  a  first  step,  we  now  examine  abstraction  discovery  within  a  fragment  of  Ap  in  which  all 
value  functions  u[i]  of  the  abstract  models  are  what  we  call  0-binary.  Later,  in  Section  4.2, 
we  show  how  our  findings  for  0-binary  abstract  value  functions  can  be  extended  to  general 
value  partitions. 

Definition  4  A  real-valued  function  f  is  called  0-binary  if  it  is  to  {0,  ex}  for  some  a  £  IR+. 

A  set  F  of  0-binary  functions  is  called  strong  if  all  the  functions  in  F  are  to  {0,  a}  for  the 
same  a  £  IR+. 

On  the  one  hand,  0-binary  functions  constitute  rather  a  basic  family  of  value  functions. 
Hence,  if  abstraction  optimization  is  hard  for  them,  it  is  likely  to  be  hard  for  any  non¬ 
trivial  family  of  abstract  value  functions.  On  the  other  hand,  0-binary  abstract  value 
functions  seem  to  fit  well  abstractions  of  planning  tasks  in  which  value  functions  are  linear 
combinations  of  indicators,  each  representing  achievement  of  a  “goal  value”  for  some  state 
variable. 

In  that  respect,  our  first  tractability  results  are  for  abstraction  discovery  within  Ap(— ,  u,  — ) 
where  u  is  a  strong  0-binary  value  partition.  The  first  (and  the  simpler)  result  in  Theo¬ 
rem  3  further  assumes  a  fixed  action  cost  partition,  while  the  next  result,  in  Theorem  7,  is 
on  simultaneous  selection  of  admissible  pairs  of  cost  and  budget  partitions.  In  Corollary  4 


15 


and  Theorem  10  we  then  show  how  the  results  of  Theorem  3  and  Theorem  7,  respectively, 
can  be  extended  to  pseudo-polynomial  algorithms  for  general  0-binary  value  partitions. 


4.1.1  Strong  0-Binary  Value  Partitions  and  the  Knapsack  problem 

Our  first  tractability  result  is  for  abstraction  discovery  within  Ap(c,  u,  — )  where  u  is  a  strong 
0-binary  value  partition  and  c  is  an  arbitrary  cost  partition.  The  key  role  here  is  played  by 
the  well-known  Knapsack  problem  (Dantzig,  1930;  Kellerer,  Pferschy,  &  Pisinger,  2004).  An 
instance  ({wi,  fu}j6rni ,  W)  of  the  Knapsack  problem  is  given  by  a  weight  allowance  W  and 
a  set  of  objects  [n],  with  each  object  i  G  [n]  being  annotated  with  a  weight  Wi  and  a  value  (J%. 
The  objective  is  to  find  a  subset  Z  C  [n]  that  maximizes  Xuez  a%  over  subsets  Z'  C  [n] 
with  YlieZ'  wi  —  W.  By  strict  Knapsack  we  refer  to  a  variant  of  Knapsack  in  which 
that  inequality  constraint  is  strict.  Knapsack  is  NP-hard  (Karp,  1972;  Garey  &  Johnson, 
1978),  but  there  exist  pseudo-polynomial  algorithms  for  it  that  run  in  time  polynomial  in  the 
description  of  the  problem  and  in  the  unary  representation  of  W  (Dudzinski  &  Walukiewicz, 
1987).  The  latter  property  makes  solving  Knapsack  practical  in  many  applications  where  the 
ratio  mi^  —  is  reasonably  low.  Likewise,  if  <7*  =  a3  for  all  i ,  j  G  [n],  then  a  greedy  algorithm 
solves  the  problem  in  linear  time  by  iteratively  expanding  Z  by  one  of  the  weight-wise 
lightest  objects  in  [n]  \  Z,  until  Z  cannot  be  expanded  any  further  within  W. 


Theorem  3  (Ap(c,  u,  — )  &  strong  0-binary  u) 

Let  II  =  (V,  so,  u;  O,  c,  b)  be  an  OSP  task,  AS  be  an  explicit  homomorphic  abstraction  skele¬ 
ton  of  Mu,  and  u  G  Up  be  a  strong  0-binary  value  partition.  Given  a  cost  partition  c  G  Cp, 
finding  an  abstraction  (c,  u,  b)  G  Ap(c,  u,  — )  and  computing  the  corresponding  heuristic 
estimate  hM(c.u.b)(so,  b)  can  be  done  in  time  polynomial  in  ||II||  and  ||AS||. 

Proof:  The  proof  is  by  reduction  to  the  polynomial  fragment  of  the  Knapsack  problem 
corresponding  to  all  items  having  identical  value.  Let  A5  =  {(Gu,  or),  . . . ,  ( G& ,  ax-)},  and, 
given  that  u  is  a  a  strong  0-binary  value  partition,  let  all  u[i]  be  to  {0,  a}  for  some  a  G  IR+. 

For  i  G  [k],  let  Wi  be  the  cost  of  the  cheapest  path  in  Gi  from  afiso)  to  (one  of  the) 
states  s  £  Si  with  u[i](s)  =  a.  Since  A5  is  an  explicit  abstraction  skeleton,  the  set  {Aj}jerfe] 
can  be  computed  in  time  polynomial  in  ||AS||  using  one  of  the  standard  algorithms  for  the 
single-source  shortest  paths  problem.  Consider  now  a  Knapsack  problem  ({wi,a}ie\u,b), 
with  weights  Wj  being  as  above  and  value  a  being  identical  for  all  objects.  Let  Z  C  [k] 
be  a  solution  to  that  (optimization)  Knapsack  problem;  recall  that  it  is  computable  in 
polynomial  time.  Given  that,  we  define  budget  profile  b*  G  B  as  follows: 


for  i  G  [k] ,  b*  [i] 


Wi,  i  G  Z 
0,  otherwise. 


What  remains  to  be  shown  is  that  (c,u,  b*)  actually  induces  an  additive  abstraction 
for  Mu-  Assume  to  the  contrary  that  AJ^c,u,b  ^  $As  Mn,  and  let  it  be  an  optimal  so-plan 
for  II.  By  the  construction  of  our  Knapsack  problem  and  of  b*,  for  each  i  G  Z,  there  is 
a  ccj(s)-plan  7 r,  for  Af.^c’u'b  ^  with  nf)  =  a.  By  Definition  1,  our  assumption  implies 

that  Q\tt)  >  Qh  ^ (tTi)  =  (J  •  \Z\.  However,  by  Theorem  2,  there  exists  at  least  one 

budget  partition  b  G  Bp  such  that  AJ('c,u,b^  €^5  Mu-  Note  that  this  budget  partition 
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induces  a  feasible  solution  Z'  =  {i  \  wt  <  b[i]}  for  our  Knapsack  problem,  satisfying 
Qb( 7r)  <  J2ieZ'Qb^( =  °  '  \Z'\-  This,  however,  implies  \Z\  <  \Z'\,  contradicting  the 
optimality  of  Z,  and  thus  accomplishing  the  proof  that  Cc  u  t>* )  Mu-  □ 

The  construction  in  the  proof  of  Theorem  3  may  appear  somewhat  counterintuitive: 
while  we  are  interested  in  minimizing  the  heuristic  estimate  of  h*(so,b),  the  abstraction 
Af(c’u’b*)  is  selected  via  the  value-maximizing  Knapsack  problem.  Indeed,  while  ultimately 
we  would  like  to  obtain 

min  hM(a,u,b)(so,b),  (6) 

b  :  (c,u,b)EAp 

the  heuristic  we  manage  to  compute  in  polynomial  time  is  actually 

max  hM(c,M,b)  (so,  b).  (7) 

b  :  (c,u,b)EAp 

As  it  can  be  seen  from  the  proof  of  Theorem  3,  Eq.  7  seems  to  be  the  best  one  can  hope  for 
to  achieve  on  the  basis  of  Ap’s  properties  captured  by  Theorem  2.  However,  note  that,  for  a 
fixed  pair  of  c  £  Cp  and  u  £  Up,  this  estimate  in  Eq.  7  is  still  at  least  as  (and  possibly  much 
more)  accurate  as  the  estimate  that  would  be  obtained  by  providing  each  of  the  k  abstract 
models  with  the  entire  budget  b.  Shortly  below  we  show  that  this  superior  accuracy  clearly 
shows  up  in  our  experiments,  but  first  we  proceed  with  examining  working  with  general 
0-binary  value  partitions. 

While  strong  0-binary  value  partitions  are  rather  restrictive,  finding  an  element  of 
Ap(c,u,  — )  for  general  0-binary  u  is  no  longer  polynomial — a  reduction  from  Knapsack 
is  straightforward.  However,  Knapsack  is  solvable  in  pseudo-polynomial  time,  and  plugging 
that  Knapsack  algorithm  into  the  proof  of  Theorem  3  results  in  a  search  algorithm  for 
Ap(c,  u,  — )  with  general  0-binary  u. 

Corollary  4  (Ap(c,  u,  — )  &  0-binary  u) 

Let  n  =  (V,  so,  u ;  O ,  c,  b)  be  an  OSP  task,  AS  be  an  explicit  homomorphic  abstraction  skele¬ 
ton  of  Mu,  and,  u  £  Up  be  a  0-binary  value  partition.  Given  a  cost  partition  c  £  Cp,  finding 
an  abstraction  (c,u,  b)  £  Ap(c,u,  — )  and  computing  the  corresponding  heuristic  estimate 
hM(C,u,b)  (so,  b )  can  be  done  in  time  polynomial  in  | |H|  |,  ||AS||,  and  the  unary  representation 
of  the  budget  b  of  II. 

To  test  and  illustrate  the  value  that  additive  abstractions  can  bring  to  heuristic-search 
OSP,  we  have  implemented  a  prototype  heuristic-search  OSP  solver  on  top  of  the  Fast 
Downward  planner  (Helmert,  2006). 6  Since,  unlike  classical  and  net-benefit  planning,  OSP 
still  lacks  a  standard  suite  of  benchmarks  for  comparative  evaluation,  we  have  cast  in  this 
role  the  STRIPS  classical  planning  domains  from  the  International  Planning  Competitions 
(IPC)  1998-2006.  This  “translation”  to  OSP  was  done  by  associating  a  separate  unit-value 
with  each  sub-goal. 

Within  our  prototype,  we  have  implemented  the  BFBB  search  for  OSP,  and  provided 
support  for  some  basic  pattern-database  abstraction  skeletons,  action  cost  partitions,  and 
abstraction  selection  in  Ap(c,  u,  — )  for  strong  0-binary  value  partitions  as  in  the  proof  of 
Theorem  3.  Specifically,  for  a  task  with  k  sub-goals, 

6.  We  are  not  aware  of  any  other  domain-independent  planner  for  optimal  OSP. 
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50 

50 

50 

50 

50 

50 

50 

50 

50 

50 

45 

45 

mystery  (4) 

4 

4 

4 

4 

4 

4 

4 

4 

4 

3 

3 

2 

openstacks  (7) 

7 

7 

7 

7 

7 

7 

7 

7 

7 

7 

7 

7 

rovers  (10) 

10 

10 

10 

10 

7 

7 

7 

6 

6 

6 

5 

5 

satellite  (9) 

9 

8 

8 

7 

6 

6 

6 

4 

5 

5 

4 

4 

tpp  (7) 

7 

7 

7 

7 

7 

7 

6 

6 

6 

6 

5 

5 

trucks  (9) 

9 

9 

9 

9 

8 

8 

6 

5 

5 

5 

5 

5 

pipesw-t  (12) 

12 

12 

12 

12 

12 

12 

12 

11 

11 

11 

10 

10 

pipesw-nt  (7) 

7 

7 

7 

7 

7 

7 

7 

7 

7 

7 

6 

6 

psr-small  (30) 

30 

30 

30 

30 

30 

30 

30 

30 

30 

30 

30 

30 

zenotravel  (10) 

10 

10 

10 

10 

9 

8 

9 

8 

8 

8 

7 

7 

total 

239 

238 

238 

234 

228 

226 

222 

211 

209 

209 

195 

191 

Table  1:  Number  of  problems  solved  across  the  different  budgets  using  the  OPEN  list  or¬ 
dered  by  the  heuristic  evaluation  as  in  Figure  2 


(i)  the  abstraction  skeleton  comprised  a  set  of  some  k  projections  of  the  planning  task 
onto  connected  subsets  of  ancestors  of  the  respective  k  goal  variables  in  the  causal 
graph; 

(ii)  the  value  partition  u  associated  the  value  of  each  sub- goal  (only)  with  the  respective 
projection;  and 

(iii)  an  ad  hoc  action  cost  partition  c. 

The  size  of  each  projection  was  limited  to  1000  abstract  states. 

In  our  evaluation,  we  compared  BFBB  node  expansions  with  three  heuristic  functions, 
tagged  blind,  basic,  and  hj^ .  With  all  three  heuristics,  the  li- value  of  a  node  n  is  set  to  0 
if  the  cost  budget  at  n  is  over-consumed.  Otherwise, 

•  blind  BFBB  constitutes  a  trivial  baseline  in  which  h{n)  is  simply  set  to  the  total  value 
of  all  goals. 

•  In  basic  BFBB,  h(n)  is  set  to  the  total  value  of  goals,  each  of  which  can  be  individually 
achieved  within  the  respective  projection  abstraction  (see  Theorem  1)  given  the  entire 
remaining  budget. 

•  is  an  additive  abstraction  heuristic  that  is  selected  from  Ap(c,  u,  — )  as  in  the 
proof  of  Theorem  3. 
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0  satellite 
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(a) 


blind 


(b) 


Figure  7:  Comparative  view  of  empirical  results  from  Table  1  in  terms  of  expanded  nodes 


The  evaluation  contained  all  the  planning  tasks  for  which  we  could  determine  offline 
the  minimal  cost  needed  to  achieve  all  the  goals.  Each  such  task  was  approached  under 
four  different  budgets,  corresponding  to  25%,  50%,  75%,  and  100%  of  the  minimal  cost 
needed  to  achieve  all  the  goals  in  the  task,  and  each  run  was  restricted  to  10  minutes. 
Table  1  shows  the  number  of  tasks  solved  within  each  domain  for  each  level  of  cost  budget, 
and  Figure  7  depicts  the  results  in  terms  of  expanded  nodes  across  the  four  levels  of  cost 
budget.  (Figures  15-18  in  Appendix  B  provide  a  more  detailed  view  on  the  results  in 
Figure  7  by  breaking  them  along  different  levels  of  cost  budget.)  Despite  the  simplicity  of 
the  abstraction  skeletons  we  used,  the  number  of  nodes  expanded  by  BFBB  with  hj^  was 


19 


typically  substantially  lower  than  the  number  of  nodes  expanded  by  basic  BFBB ,  with  the 
difference  sometimes  reaching  three  orders  of  magnitude. 

4.1.2  Freeing  Cost  Partition:  Knapsack  Meets  Convex  Optimization 

Returning  now  to  the  algorithmic  analysis  in  the  context  of  strong  0-binary  value  partitions, 
we  now  proceed  with  relaxing  the  constraint  of  sticking  to  a  fixed  action  cost  partition  c. 
This  buys  more  flexibility  in  selecting  abstractions  from  Ap(— ,u,  — )  allowing  to  improve 
the  accuracy  of  the  heuristic  estimates,  while  still  remaining  computationally  tractable. 

Given  an  OSP  task  II  =  (V,  sq,  ir,  O,  c,b),  a  homomorphic  abstraction  skeleton  AS,  and 
a  value  partition  u  6  Up  over  *45,  let 


«(u) 


min  max  h  M(c,u,b){so,b) 
cEC p  b  :  (c,u,b)EAp 


(8) 


Obviously,  the  estimate  h(so,  b)  =  k(u)  is  at  least  as  accurate  as  the  estimate  in  Eq.  7  that 
is  derived  with  respect  to  a  fixed  cost  partition  c. 

We  now  show  that,  for  any  OSP  task  II,  any  abstraction  skeleton  AS  =  {(Gi,  au), . . . ,  (G&,  cq.)} 
of  Mu  and  any  strong  0-binary  value  partition  u  £  Up  over  *45,  computing  re(u)  is  poly¬ 
nomial  time.  The  corresponding  algorithm  is  shown  in  Figure  8,  with  Figure  8a  depicting 
the  macro-flow  of  the  algorithm  and  Figure  8b  depicting  the  specific  implementation  of  the 
solve  sub-routine  that  makes  the  overall  time  complexity  of  the  algorithm  polynomial. 

The  high-level  flow  of  the  algorithm  in  Figure  8a  is  as  follows.  Since  u  is  a  strong  0-binary 
value  partition,  let  all  abstract  value  functions  u[i]  be  to  {0,  a}  for  some  a  £  IR+.  Given  that, 
for  each  (c,  u,  b)  £  Ap(— ,  u,  — ),  it  holds  that  /i(c.u, b)  (s)  =  rna  f°r  some  rri  £  {0}  U  [k\.  The 
first,  preprocessing  for-loop  of  the  algorithm  eliminates  from  the  abstraction  skeleton  all  the 
nodes  that  are  structurally  unreachable  from  the  abstract  initial  states  au(so)>  •  •  • ,  ctfc(so)-7 
For  ease  of  presentation,  in  what  follows  we  assume  that  this  clean  up  of  the  abstraction 
skeleton  leaves  each  Gj  with  at  least  one  state  whose  value  is  a.  The  second,  main  for-loop 
of  the  algorithm  decreasingly  iterates  over  all  the  values  { ka ,  (k  —  1  )<j, . . . ,  2a,  a}  that  can 
possibly  come  from  the  abstractions  in  Ap(— ,  u,  — )  as  an  estimate  of  h*(so ,  b).  Each  of  these 
candidates  for  k(u)  is  tested  in  turn  via  the  sub-routine  always-achievable.  If  and  when  this 
test  comes  positive  for  the  first  time,  then  we  are  done,  and  the  tested  candidate  ma  is 
identified  as  k(u).  Otherwise,  if  the  test  fails  for  all  m  £  [k] ,  then  r(u)  =  0,  in  particular 
implying  that  no  state  with  value  greater  than  0  can  be  reached  from  so  under  budget  b. 

The  test  of  always-achievable  for  k(u)  =  ma  is  based  on  a  linear  program  (LP)  C\ (m), 
given  by  Eq.  10.  This  linear  program  is  defined  over  variables 


*  =  {£}  u  U 

i&[k] 


{d(s)}seGi  U  {b[i]}  U  [J  {c[*](o)}  , 
oeo 


(9) 


constraints  (10a)-(10c),  and  the  objective  of  maximizing  the  value  of  the  variable  £. 


7.  This  preprocessing  can  be  replaced  by  adding  some  extra  constraints  in  the  linear  program  described 
below.  However,  that  would  unnecessary  complicate  the  presentation  without  adding  much  value. 
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input:  II  =  (V,s0,u;  0,c,b ),  AS  =  {(Gi,«i), . . . ,  ( Gk,ak )}  of  Mn, 
strong  0-binary  value  partition  u  £  Up 
output:  k(u) 

for  i  =  1  to  k  do 

reduce  Gi  to  only  nodes  reachable  from  aj(so) 

for  m  =  k  downto  1  do 

if  always-achievable(m)  then  return  mo 
return  0 

always-achievable(m): 
solve(£i(m))  e-)-  solution  x  £  dom(X) 
if  x[£]  <  b  then  return  true 
else  return  false 

(a) 


solve(£i(m)): 

set  (10c’)  to  an  arbitrary  subset  of  constraints  (10c) 

loop 

set  C\  (m)  to  C\ (m),  with  constraints  (10c’)  instead  of  (10c) 
ellipsoid- method (C'^m))  >-)>  solution  x  £  dom(X) 
let  r  be  a  permutation  of  [ k ]  such  that 

x[%(!)]]  <  x[bk(2)]]  <  ■  ■  ■  <  x[b[r(A:)]] 
if  x[f]  <  J2ie[m]  xNr(0]]  then  return  x 
extend  (10c’)  with  constraint  £  <  Mr(*)] 

(b) 

Figure  8:  A  polynomial-time  algorithm  for  computing  k(u)  for  a  strong  0-binary  value 
partition  u  £  Up  (Theorem  7). 


Ci(m)  : 

max  £ 
subject  to 

d(ati(s))  =  0, 

d(s)  <  d(s')  +  <c[i](o),  V(s',  a,  s )  £  Gi 
Wl  <  d(s),  Vs  £  Gi  s.t.  u[i](s) 

j c[i](o)  >0,  Vi  £  [k] 

\Eie[fc]cM(°)  ^  c(°) 

\/Z  C  [A;],  \Z\  =  m  :  £  <  b[i], 

iez 


Vi  £  [k]  :  < 


Vo  £  O  : 


=  o 


(10a) 


(10b) 

(10c) 
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The  roles  of  the  different  variables  in  C\(m )  are  as  follows. 

•  Variable  <c[i](o)  captures  the  cost  to  be  associated  with  label  o  in  the  digraph  Gi  of 
AS. 

•  For  a  state  s  in  Gi,  variable  d(s)  captures  the  cost  of  the  cheapest  path  in  Gi  from 
CKi(so)  to  s,  given  that  the  edges  of  Gi  are  weighted  consistently  with  the  values  of  the 
variables  c [*](•). 

•  Variable  lb  [i]  captures  the  minimal  budget  needed  for  reaching  in  Gi  a  state  with  value 
a  from  state  «i(so),  given  that,  again,  the  edges  of  Gi  are  weighted  consistently  with 
the  variable  vector  c  [i] . 

•  The  singleton  variable  £  captures  the  minimal  total  cost  of  reaching  states  with  value 
a  in  precisely  m  out  of  k  models  in  _/\Ac,u,lb\ 

The  semantics  of  the  constraints  in  C\[m)  are  as  follows. 

•  The  first  two  sets  of  constraints  in  (10a)  come  from  a  simple  LP  formulation  of 
the  single  source  shortest  paths  problem  with  the  source  node  cti(so):  Optimizing 
Sie[fc]  SseG,  ^(s)  under  a  fixed  weighting  c  of  the  edges  leads  to  computing  precisely 
that,  for  all  k  digraphs  in  yUS  simultaneously. 

•  The  third  set  of  constraints  in  (10a)  establishes  the  costs  of  the  cheapest  paths  in  {Gi} 
from  states  «i(so)  to  states  valued  a,  enforcing  the  semantics  of  variables  Ib[l], . . . ,  b[k] . 

•  Constraints  (10b)  are  the  cost  partition  constraints  that  enforce  c  G  Cp. 

•  Constraints  (10c)  enforce  the  aforementioned  semantics  of  the  objective  variable  £. 

Two  things  are  worth  noting  here.  First,  if  all  the  nodes  in  the  digraphs  G\, . . .  ,Gk 
are  structurally  reachable  from  the  “source  nodes”  au(so),  •  •  • ,  ctk(so),  respectively  (as  it 
is  ensured  by  the  first  for- loop  of  the  algorithm),  then  the  polytope  induced  by  C\{m)  is 
bounded  and  non-empty.  Indeed,  for  any  assignment  to  UoeO  {®[*](o)}  that  is  consistent 
with  the  positiveness  constraints  in  (10b),  all  the  variables  ri(-)  are  bounded  from  above  by 
the  lengths  of  the  respective  shortest  paths.  In  turn,  this  bounding  of  d(-)  bounds  from 
above  the  variables  c[l], . . . ,  c[k]  via  the  third  set  of  constraints  in  (10a),  and  the  constraints 
(10c)  then  bound  from  above  the  objective  £. 

Second,  while  the  number  of  variables,  as  well  as  the  number  of  constraints  in  (10a) 
and  (10b),  are  polynomial  in  ||II||  and  ||v/4<S||,  the  number  of  constraints  in  (10c)  is  (k)  ■ 
Thus,  solving  C\(m)  using  standard  methods  for  linear  programming  is  not  practical.  In 
Lemma  5  below  we  show  that  this  issue  can  actually  be  mitigated,  but  then,  in  Lemma  6 
we  show  that  the  semantics  of  C\{m)  match  our  objective  of  finding  k(u). 

Lemma  5  The  algorithm  in  Figure  8  terminates  in  time  polynomial  in  ||II||  and  ||AS||. 

Proof:  The  runtime  complexity  of  the  algorithm  boils  down  to  the  complexity  of  solving 
and,  while  the  number  of  variables  in  as  well  as  the  number  of  constraints 

in  (10a)  and  (10b),  are  polynomial  in  ||II||  and  ||yl<S||,  the  number  of  constraints  in  (10c)  is 
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(  kj.  Thus,  solving  C\{m)  using  standard  methods  for  linear  programming  is  not  practical. 
However,  using  the  ellipsoid  algorithm  for  linear  inequalities  (Grotschel,  Lovasz,  &  Schrijver, 
1981),  an  LP  with  an  exponential  number  of  constraints  can  be  solved  in  polynomial  time 
provided  that  an  associated  “separation  problem”  can  be  solved  in  polynomial  time.  In  our 
case,  the  separation  problem  is,  given  an  assignment  to  the  variables  of  C\{rn),  test  whether 
it  satisfied  (10a),  (10b),  and  (10c),  and  if  not,  produce  an  inequality  among  (10a),  (10b), 
and  (10c)  violated  by  that  assignment. 

We  now  show  how  our  separation  problem  for  C\(m)  can  be  solved  in  polynomial  time 
using  what  is  called  m-sum  minimization  LPs  (Punnen,  1992),  and  this  is  precisely  what 
the  procedure  solve(£i(m)  in  Figure  8b  does.  As  the  number  of  constraints  in  (10a)  and 
(10b)  is  polynomial,  their  satisfaction  by  an  assignment  x  £  dom{X)  can  be  tested  directly 
by  substitution.  For  constraints  (10c),  let  r  be  a  permutation  of  [k]  such  that  x[b[r(l)]]  < 
x[b[r(2)]]  <  ■  ■  ■  <  x[b[r(fe)]].  If  x[£]  <  X^*e[m]  x[b[r(i)]],  then  it  is  easy  to  see  that  x  satisfies 
all  the  constraints  in  (10c).  Otherwise,  we  have  our  violated  inequality  £  <  k[r(z)]- 

□ 


Lemma  6  The  algorithm  in  Figure  8a  computes  r(u). 

The  proof  of  Lemma  6  appears  in  Appendix  A,  p.  45.  Putting  Lemmas  5  and  6  together, 
Theorem  7  summarizes  our  tractability  result  for  abstraction  discovery  in  Ap(— ,  u,  — )  for 
strong  0-binary  value  partitions  u. 

Theorem  7  (Ap(— ,  u,  —  )(s)  &  strong  0-binary  u) 

Given  an  OSP  task  n  =  ( V,so,u;0,c,b },  a  homomorphic  explicit  abstraction  skeleton  AS 
of  Mu,  and  a  strong  0-binary  value  partition  u  £  Upj  computing  re(u)  can  be  done  in  time 
polynomial  in  | |H| |  and  ||AS||. 

4.1.3  From  Strong  to  General  0-Binary  Value  Partitions 

Recall  that  the  polynomial  result  of  Theorem  3  for  strong  0-binary  value  partitions  easily 
extends  in  Corollary  4  to  a  pseudo-polynomial  algorithm  for  general  0-binary  value  parti¬ 
tions.  It  turns  out  that  a  pseudo-polynomial  extension  of  Theorem  7  is  possible  as  well, 
though  it  is  technically  more  involved.  The  corresponding  algorithm  is  shown  in  Figure  9. 
Following  the  format  of  Figure  8,  Figure  9a  depicts  the  macro- flow  of  the  algorithm  and 
Figure  9b  shows  the  specific  implementation  of  the  solve  sub-routine  that  allows  achieving 
the  desired  time  complexity. 

Similarly  to  the  algorithm  in  Figure  8,  first,  a  preprocessing  for-loop  of  the  algorithm 
eliminates  from  the  abstraction  skeleton  all  the  nodes  that  are  structurally  unreachable 
from  the  abstract  initial  states  au(so), . . .  ,  ctfc(so)-  Next,  the  algorithm  performs  a  binary 
search  over  an  interval  containing  k(u).8  Since  u  is  a  0-binary  value  partition,  for  i  £  [k], 
by  {0,  (jj},  cij  £  IR+,  we  denote  the  range  of  the  abstract  value  function  u[i].  Given  that, 
for  each  (c,u,  b)  £  Ap(— ,  u,  — ),  it  holds  that  /i(c,u,b)(s)  =  Xuez  f°r  some  Z  Q  [&].  As 
the  size  of  this  combinatorial  hypothesis  space  is  prohibitive,  the  while-loop  in  Figure  9 

8.  While  a  binary  search  could  have  been  used  in  the  algorithm  in  Figure  8  as  well,  there  it  would  be  a 
mere  optimization,  while  here  it  is  necessary  to  avoid  an  exponential  blowup  of  the  time  complexity. 
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input:  II  =  (V,s0,u;  0,c,b ),  AS  =  {(Gi,ai), . . . ,  ( Gk,ak )}  of  Mn, 
0-binary  value  partition  u  G  Up 
output:  ks(u) 

for  i  =  1  to  k  do 

reduce  Gi  to  only  nodes  reachable  from  Oi(so) 
let  0  <  e  <  mirijerfci  cr^ 

Oi  i —  0 

P  Z)ie[fc]  ai 

while  (3  —  a  >  e  do 

V  i —  (X  “l-  (/?  —  cr)/2 
if  always-achievable(u)  then  a  v 
else  (3  v 

if  cc  =  0  then  return  0 
else  return  (3 

always-achievable(u): 
solve(£2(^))  '-t  solution  x  G  dom(X) 
if  x[£]  <  b  then  return  true 
else  return  false 

(a) 


solve(£2(u)): 

set  (11c')  to  an  arbitrary  subset  of  constraints  (11c) 

loop 

set  C'2(v)  to  £2(u),  with  constraints  (11c')  instead  of  (11c) 
ellipsoid-method (£'2 (u ) )  i-g  solution  x  G  dom(X ) 
strict-Knapsack(^{x[b[i]],<Tj}jerfe|,x[^]))  i-g  solution  Z  C  [k] 
if  Yliez  <  v  then  return  x 
extend  (11c')  with  constraint  £  <  ^eZb[i] 

(b) 

Figure  9:  A  pseudo-polynomial  algorithm  for  approximating  k(u)  for  general  O-binary  value 
partitions  u  G  Up  (Theorem  10). 


performs  a  binary  search  over  a  relaxed  hypothesis  space,  corresponding  to  the  continuous 
interval  [0,  °f  ^+0-  The  parameter  e  serves  as  the  “sufficient  precision”  criterion 

for  termination. 

At  iteration  corresponding  to  an  interval  [a,/?],  the  algorithm  uses  its  sub-routine 
always-achievable  to  test  the  hypothesis  k(u)  >  v,  where  v  is  the  mid-point  of  [a,/?].  If 
the  test  comes  positive,  then  the  next  tested  hypothesis  is  ks(u)  >  v ',  where  v1  is  the  mid¬ 
point  of  [v,/3\.  Otherwise,  the  next  hypothesis  corresponds  to  the  midpoint  of  [at,v).  When 
the  while-loop  is  done,  the  reported  estimate  is  set  to  /3;  while  there  still  might  be  some 
lag  between  /3  and  k(u),  this  lag  can  be  arbitrarily  reduced  by  reducing  e,  and  anyway, 
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f3  >  k(u)  ensures  admissibility  of  the  estimate.  If,  however,  the  while-loop  terminates  with 
a  =  0,  then  k(u)  <  f3  <  e  <  rnin7;e r^.i  Oi  implies  k(u)  =  0,  and  this  is  what  we  return. 

The  test  of  always-achievable  for  re(u)  >  v  is  based  on  a  linear  program  £2(11),  which  is 


defined  over  variables  X  as  in 

Eq.  9,  and  is  obtained  from  £\{m)  by  replacing  constraints 

(10c)  with  constraints  (11c): 

£2(v)  : 

max  £ 

subject  to 

d(cxi(s))  =  0, 

Vi  £  [k]  :  < 

d(s)  <  d(s')  +  c[i](o) 

,  V(s/,  a,  s)  £  Gi  , 

(11a) 

>[*]  <  d(s), 

Vs  £  Gi  s.t.  u[i](s)  =  Gi 

Vo  £  0  :  | 

fc[i](o)  >  0, 
[Eie[fc]C[*](o)  <  c(o) 

Vi  £  [k] 

1 

(lib) 

MZ  C  [k]  s.t.  Gi  >  v  : 

£  <  ^2b[i]. 

(11c) 

iez  iez 


While  the  semantics  of  all  variables  but  £  remains  as  in  £  now  captures  the 

minimal  total  cost  of  reaching  some  states  in  the  abstract  models  such 

that  the  total  value  X^effc]  u[i](sj)  >  v-  The  new  constraint  (11c)  enforces  this  semantics  of 
£• 

Lemma  8  For  any  e  >  0,  the  algorithm  in  Figure  9  terminates  in  time  polynomial  in  ||II||, 
||VUS||,  logt;  and  a  unary  representation  of  the  budget  b  of  II. 

Proof:  The  number  of  iterations  of  the  while- loop  is  approximately  log2  — 1 ,  and  the 
run-time  of  each  of  its  iterations  boils  down  to  the  complexity  of  solving  £2(11).  Similarly 
to  what  we  had  in  Lemma  5  with  linear  programs  £\ (to),  while  the  number  of  variables  in 
£2 (v),  as  well  as  the  number  of  constraints  in  (11a)  and  (lib),  are  polynomial  in  ||II||  and 
||AS||,  the  number  of  constraints  in  (11c)  is  @(2fc).  Therefore,  solve(£2(^))  also  employs 
the  ellipsoid  method  with  a  sub-routine  for  the  associated  separation  problem.  We  now 
show  how  that  separation  problem  for  £2(^)  can  be  solved  in  pseudo-polynomial  time  using 
a  standard  pseudo-polynomial  procedure  for  the  strict  Knapsack  problem. 

Given  an  assignment  x  £  domfX),  its  feasibility  with  respect  to  (11a)  and  (lib)  can  be 
tested  directly  by  substitution.  For  constraints  (11c),  let  Z  C  [k]  be  an  optimal  solution 
to  the  strict  Knapsack  problem  ^{x[fo[i]],  Gi}iew\,  x[£]),  with  a  weight  allowance  x[£]  and  k 
objects,  with  each  object  i  £  [k\  being  associated  with  weight  x[Ib[i]]  and  value  a 

•  If  the  value  Yli^z  °f  Z  is  smaller  than  v,  then  x  satisfies  all  the  constraints  in 
(11c).  Assume  to  the  contrary  that  x  violates  some  constraint  in  (11c),  corresponding 
to  a  set  Z'  C  [fcj.  By  definition  of  (11c),  Yli^z’  ai  —  v>  anc^  by  our  assumption, 
x[£]  >  Yliez1  x[lb[*]]-  That,  however,  implies  that  Z'  is  a  feasible  solution  for  our 
strict  Knapsack,  and  of  value  higher  than  that  of  presumably  optimal  Z. 
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•  Otherwise,  if  Ylipz  ai  —  vi  then  Z  itself  provides  us  with  a  constraint  in  (11c)  that  is 
violated  by  x.  This  is  because  x[£]  >  x[fo[i]]  holds  by  the  virtue  of  Z  being  a 

solution  to  the  strict  Knapsack  problem  ({x[b[i]],  x[£]). 

□ 


Lemma  9  For  any  0  <  e  <  minigrfe]  cr,;,  the  algorithm  in  Figure  9a  computes  such  that 
Ke  —  k( u)  <  e. 

The  proof  of  Lemma  9  appears  in  Appendix  A,  p.  47.  Putting  Lemmas  8  and  9  together, 
Theorem  10  summarizes  our  result  for  optimized  abstraction  discovery  in  Ap(— ,  u,  — )  for 
general  0-binary  value  partitions  u.  Importantly,  note  that  the  algorithm  in  Figure  9 
depends  on  the  unary  representation  of  only  the  budget,  and  not  of  the  possible  state  values. 
In  particular,  it  means  that  dependence  of  the  complexity  on  the  number  of  alternative  sub¬ 
goals  in  the  OSP  task  of  interest  is  only  polynomial.  Finally,  the  statement  of  Theorem  10 
involves  the  precision  of  the  estimate  only  because  the  cr,;  values  of  the  abstract  value 
functions  u[i]  can  be  arbitrary  real  numbers.  In  the  case  of  integer-valued  sets  of  functions 
u,  as  well  as  in  various  special  cases  of  real-valued  functions,  k(u)  can  be  determined 
precisely  using  a  simplification  of  the  algorithm  in  Figure  9.  For  instance,  if  all  cri, . . . ,  <Jk 
are  integers,  then  setting  e  to  any  value  in  (0, 1)  results  in  the  while-loop  terminating  with 
a  =  ac(u).  These  details,  however,  are  more  of  a  theoretical  interest;  for  reasonably  small 
values  of  e,  in  practice  there  will  be  no  difference  between  estimates  h(s,  b)  and  h(s,  b )  +  e. 

Theorem  10  (Ap(— ,  u,  —  )(s)  0-binary  u) 

Given  an  OSP  task  II  =  (V,  so,  u;  O,  c,b),  a  homomorphic  explicit  abstraction  skeleton  AS  = 
{(Gi,  «i), . . . ,  {Gk,  «fc)}  of  Mu,  a  0-binary  value  partition  u  €  Up,  and  e  >  0,  approximating 
ks(u)  within  an  additive  factor  of  e  can  be  done  in  time  polynomial  in  ||II||,  ||AS||,  log^, 
and  a  unary  representation  of  the  budget  b  of  II. 

4.2  General  Value  Partitions 

While  0-binary  value  partitions  can  be  rather  useful  by  themselves,  turns  out  that  the 
pseudo-polynomial  algorithms  for  abstraction  discovery  with  explicit  homomorphic  abstrac¬ 
tion  skeletons  and  0-binary  value  partitions  can  be  extended  rather  easily  to  arbitrary  value 
partitions.  All  it  takes  is  to  notice  that, 

(1)  For  any  OSP  task  II  =  (V,  so,w,0,c,b),  any  homomorphic  abstraction  skeleton  AS  = 
{(Gi,  ai), . . . ,  (Gk,  ctfc)}  of  Mjj,  and  any  value  partition  u  over  AS,  then  number  of 
distinct  values  taken  by  u[i]  is  trivially  upper-bounded  by  the  number  of  states  in  Gp 
and 

(2)  The  pseudo-polynomial  solvability  of  the  Knapsack  problem  extends  to  its  more  general 
variant  known  as  Multiple-Choice  Knapsack  (Dudzinski  &  Walukiewicz,  1987;  Kellerer 
et  al.,  2004). 

The  Multiple-Choice  (MC)  Knapsack  problem  (Ni, . . . ,  Nm-,  W)  is  given  by  a  weight 
allowance  W  and  m  classes  of  objects  N\, ... ,  Nm,  with  each  object  j  £  being  annotated 
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with  a  weight  w^  and  a  value  at] .  The  objective  is  to  find  a  set  Z  that  contains  at  most 
one  object  from  each  class  and  maximizes  j)<zz  aij  over  such  sets  while  satisfying 

j)ez  wv  —  W  -  By  strict  MC-Knapsack  we  refer  to  a  variant  of  MC-Knapsack  in 
which  that  inequality  constraint  is  strict.  MC-Knapsack  generalizes  regular  Knapsack  and 
thus  it  is  NP-hard.  However,  similarly  to  the  regular  Knapsack,  MC-Knapsack  also  admits 
a  pseudo-polynomial,  dynamic  programming  algorithm  that  runs  in  time  polynomial  in  the 
description  of  the  problem  and  in  the  unary  representation  of  W  (Dudzinski  Sz  Walukiewicz, 
1987;  Kellerer  et  al.,  2004). 

Theorem  11  (Ap(c,u,  — )) 

Let  n  =  (V,  so,  u ;  O,  c,  b)  be  an  OSP  task,  AS  =  {(Gi,  ou), . . . ,  (G&,  a*,)}  be  an  explicit  ho¬ 
momorphic  abstraction  skeleton  of  Mu,  and  u  £  Up  be  an  arbitrary  value  partition  over 
AS.  Given  a  cost  partition  c  £  Cp,  finding  an  abstraction  (c,  u,  b)  £  Ap(c,  u,  — )  and  com¬ 
puting  the  corresponding  heuristic  estimate  hM(c,u,b)(so,b)  can  be  done  in  time  polynomial 
in  l|n||,  ||AS||,  and  the  unary  representation  of  the  budget  b. 

Proof:  The  proof  is  very  similar  to  the  proof  of  Theorem  3,  but  with  the  compilation  being 
to  the  MC-Knapsack  problem. 

For  i  £  [k],  let  u[z]  be  to  {era,  •  •  •  ,  C  IR+,  and,  for  j  £  [m],  Wij  be  the  cost  of  the 
cheapest  path  in  G*  from  oifis®)  to  (one  of  the)  states  s  £  Si  with  u[z](s)  =  .  Since 

AS  is  an  explicit  abstraction  skeleton,  for  i  £  [k],  nt  <  |5j|,  and  the  set  {u’ij}i£[k\,je[ni] 
can  be  computed  in  time  polynomial  in  ||AcS||  using  one  of  the  standard  algorithms  for  the 
single-source  shortest  paths  problem. 

Consider  now  an  MC-Knapsack  problem  with  a  weight  allowance  b  and  k  classes  of 
objects  Ni, . . . ,  Nk,  with  | Nf  =  n{  and  each  object  j  £  N,  being  annotated  with  a  weight 
Wij  and  a  value  <jjj.  Let  Z  C  N,  be  a  solution  to  that  (optimization)  MC-Knapsack 
problem;  recall  that  it  is  computable  in  pseudo-polynomial  time.  Given  that,  we  define 
budget  profile  b*  £  B  as  follows: 


for  i  £  [fe] ,  b*  [z] 


Wij,  0 i,j)ez 
0,  otherwise. 


Showing  that  (c,  u,  b*)  actually  induces  an  additive  abstraction  for  Afn  is  completely  iden¬ 
tical  to  the  proof  of  the  corresponding  argument  in  Theorem  3,  and  thus  omitted.  □ 


Theorem  12  (Ap(— ,  u,  — )) 

Given  an  OSP  task  n  =  (V,  so,  u;  O,  c,b),  a  homomorphic  explicit  abstraction  skeleton  AS  = 
{(Gi,  «i), . . . ,  (Gfc,  Ofc)}  of  Mu,  an  arbitrary  value  partition  u  £  Up  over  AS,  and  e  >  0, 
approximating  ks(u)  within  an  additive  factor  of  e  can  be  done  in  time  polynomial  in  ||n||, 
||AS||,  log^,  and  a  unary  representation  of  the  budget  b  of  n. 

An  algorithm  for  abstraction  discovery  as  in  Theorem  12  is  depicted  in  Figure  10.  Its 
high-level  flow  differs  from  the  flow  of  the  algorithm  from  Figure  9  for  general  0-binary  value 
partitions  only  in  the  initialization  of  parameters  e  and  (3.  The  major  difference  between 
the  algorithms  is  that  here  the  tests  of  candidate  values  v  are  delegated  to  a  different 
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input:  II  =  (V,s0,u;O,c,b),  AS  =  {(Gi,ai), . . . ,  ( Gk,ak )}  of  Mn, 
0-binary  value  partition  u  G  Up 
output:  ks(u) 

for  i  =  1  to  k  do 

reduce  G,  to  only  nodes  reachable  from  aj(so) 
let  0  <  e  <  minie[fe]  minie[n.] 

Oi  i —  0 

P  "  12ie[k]  maxje[ni]  &ij 

while  (5  —  a  >  e  do 

v  i —  Oi  T  (/?  —  cr)/2 
if  always-achievable(u)  then  a  <—  v 
else  (3  <—  v 

if  a  =  0  then  return  0 
else  return  (3 

always-achievable(u): 
solve(£2(w))  i-)-  solution  x  G  dom(X) 
if  x[£]  <  b  then  return  true 
else  return  false 

(a) 


solve(£3(u)): 

set  (13c7)  to  an  arbitrary  subset  of  constraints  (13c) 

loop 

set  C'^{v)  to  C${v),  with  constraints  (13c')  instead  of  (13c) 
ellipsoid-method)!!!,  (v) )  i-g  solution  x  G  dom(X) 

strict-MC-Knapsack(({x[b[l,  j]],  vij}je\n{y  ■  ■  ■ ,  {x[lb[A:,  j]] ,  crkj}jG[nky,  x[£])) 
eG  solution  Z  G  [ni]  x  •  •  •  x  [rife] 

[A:]  aiZ{i)  <  v  then  return  x 
extend  (13c')  with  constraint  £  <  ^(*)] 

(b) 

Figure  10:  (a)  A  modification  of  the  algorithm  from  Figure  9  to  arbitrary  value  partitions 
u  G  Up  (Theorem  12),  and  (b)  the  respective  solve  sub-routine  that  is  based  on 
linear  programs  Cs(v)  in  Eq.  13 


solve  sub-routine  (Figure  10b)  that  is  based  on  linear  programs  C^{v),  which  are  defined  as 
follows. 
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For  i  G  [k],  let  u[i]  be  to  {an, . . . ,  <7jni}  C  IR+.  For  v  G  IR+,  the  linear  program  £%(v)  is 
defined  in  Eq.  13  over  variables 


*  =  {£}  u  U 

ie[fe] 


{d(s)}seGi  U  (J  {lb[*,j']}U  |J  {c[i](o)} 

j6[n.;]  oeO 


(12) 


These  variables  differ  from  the  variable  set  of  £2(1))  (see  Ed-  9)  by  a  larger  set  of  b- variables: 
Variable  b[i,j]  here  captures  the  minimal  budget  needed  for  reaching  in  Gi  a  state  with 
value  aj  j  from  state  aj(so),  given  that  the  edges  of  Gi  are  weighted  consistently  with  the 
variable  vector  c  [i] . 


£z{v)  : 

max  £ 


subject  to 

d(ai(s ))  =  0, 

Mi  G  [k\  :  < 

d(s)  <  d(s')  +  c[i](o) 

,  M(s',  a,  s )  G  Gi 

,b[b  j]  <  d(s), 

Mj  G  [nj]Vs  G  Gi  s.t.  u[i](s) 

=  aij 

(13a) 

Mo  G  O  :  | 

fc[i](o)  >  0, 

[Ei6[fe]CW(°)  ^  c(°) 

Mi  G  [A;] 

5 

(13b) 

MZ  G  [ni]  x 

•  •  •  x  [nk] 

s-t-Z) 

°iZ{i)  >  v  : 

(13c) 

i£[k]  i£[k] 


Similarly  to  what  we  had  in  Lemma  8  with  linear  programs  £2(1)),  while  the  number  of 
variables  in  £$(y),  as  well  as  the  number  of  constraints  in  (13a)  and  (13b),  are  polynomial  in 
| |II|  |  and  ||AS||,  the  number  of  constraints  in  (13c)  is  @(dk)  where  d  =  maxjerfei  nt.  There¬ 
fore,  solve(£3(u))  also  employs  the  ellipsoid  method  with  a  pseudo-polynomial  separation 
problem,  but  here  it  is  based  on  strict  MC-Knapsack.  Otherwise,  solving  £2{v)  and  solving 
£s(v)  are  similar. 

Lemma  13  For  any  e  >  0,  the  algorithm  in  Figure  10  terminates  in  time  polynomial  in 
||L1||,  ||AS||,  log^,  and  a  unary  representation  of  the  budget  b  of  li I. 

Lemma  14  Given  an  OSP  task  II  =  ( V,so,u;0,c,b ),  a  homomorphic  explicit  abstraction 
skeleton  =  {(Gu,  aq), . . . ,  (Gk,  «fc)}  of  Mu,  an  arbitrary  value  partition  u  G  Up  over 
and  e  >  0,  the  algorithm  in  Figure  10  computes  Ke  such  that  —  k(u)  <  e. 

The  proof  of  Lemma  13  is  similar  to  the  proof  of  Lemma  8,  with  strict  Knapsack 
separation  problems  being  replaced  with  strict  MC-Knapsack  separation  problems.  The 
proof  of  Lemma  14  is  also  similar  to  the  proof  of  Lemma  9,  mutatis  mutandis.  Together, 
Lemmas  14  and  13  establish  Theorem  12. 
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5.  Landmarks  in  OSP 

In  addition  to  state-space  abstractions,  a  family  of  approximation  techniques  that  have  been 
found  extremely  effective  in  the  context  of  optimal  classical  planning  is  based  on  the  notion 
of  logical  landmarks  for  goal  reachability  (Karpas  &  Domshlak,  2009;  Helmert  &;  Domshlak, 
2009;  Domshlak  et  ah,  2012;  Bonet  &  Helmert,  2010b;  Pommerening  &  Helmert,  2013). 
In  this  section  we  proceed  with  examining  the  prospects  of  such  reachability  landmarks  in 
heuristic-search  OSP  planning. 

5.1  Landmarks  in  Classical  Planning 

For  a  state  s  in  a  classical  planning  task  n,  a  landmark  is  a  property  of  operator  sequences 
that  is  satisfied  by  all  s-plans  (Hoffmann,  Porteous,  &  Sebastia,  2004).  For  instance,  a 
“fact  landmark”  for  a  state  s  is  an  assignment  to  a  single  variable  that  is  true  at  some 
point  in  every  s-plan.  Most  state-of-the-art  admissible  heuristics  for  classical  planning  use 
what  is  called  disjunctive  action  landmarks,  each  corresponding  to  set  of  operators 
such  that  every  s-plan  contains  at  least  one  operator  from  that  set  (Karpas  &  Domshlak, 
2009;  Helmert  &  Domshlak,  2009;  Bonet  &  Helmert,  2010a;  Pommerening  &  Helmert, 
2013).  In  what  follows  we  consider  this  popular  notion  of  landmarks,  and  simply  refer  to 
disjunctive  action  landmarks  for  a  state  s  as  s-landmarks.  For  ease  of  presentation,  most  of 
our  discussion  will  take  place  in  the  context  of  landmarks  for  the  initial  state  of  the  task, 
and  these  will  simply  be  referred  to  as  landmarks  (for  n ). 

Deciding  whether  an  operator  set  L  C  O  is  a  landmark  for  classical  planning  task  n  is 
PSPACE-hard  (Porteous,  Sebastia,  &  Hoffmann,  2001).  Therefore,  all  landmark  heuristics 
employ  methods  for  landmark  discovery  that  are  polynomial-time,  sound,  but  incomplete. 
In  what  follows  we  assume  access  to  such  a  procedure;  the  actual  way  the  landmarks  are 
discovered  is  tangential  to  our  contribution.  For  a  set  C  of  s-landmarks,  a  landmark  cost 
function  Icost  :  C  — >  IR0+  is  admissible  if  YheC  lcost{L)  <  h*(s).  For  a  singleton  set 
C  =  {L},  lcost(L)  :=  minoe£  c(o)  is  a  natural  admissible  landmark  cost  function,  and  it 
extends  directly  to  non-singleton  sets  of  pairwise  disjoint  landmarks.  For  more  general  sets 
of  landmarks,  Icost  can  be  devised  in  polynomial  time  via  operator  cost  partitioning  (Katz 
&  Domshlak,  2010b),  either  given  C  (Karpas  &  Domshlak,  2009),  or  within  the  actual 
process  of  generating  C  (Helmert  &  Domshlak,  2009). 

5.2  e-Landmarks  and  Budget  Reduction 

While  landmarks  play  an  important  role  in  (both  satisficing  and  optimal)  classical  planning, 
so  far  they  have  not  been  exploited  in  OSP.  At  first  glance,  this  is  probably  no  surprise, 
and  not  only  because  OSP  have  been  investigated  much  less  than  classical  planning:  Since 
landmarks  must  be  satisfied  by  all  plans,  and  empty  operator  sequence  is  always  a  plan 
for  any  OSP  task,  the  notion  of  landmark  does  not  seem  useful  here.  Having  said  that, 
consider  the  anytime  “output  improvement”  property  of  the  BFBB  forward  search.  The 
empty  plan  is  not  interesting  there  not  only  because  it  is  useless,  but  also  because  it  is 
“found”  by  the  search  algorithm  right  at  the  get-go.  In  general,  at  all  stages  of  the  search, 
anytime  algorithms  like  BFBB  maintain  the  best-so-far  solution  7 r,  and  prune  all  branches 
that  promise  value  lower  or  equal  to  Qb(ir).  Hence,  in  principle,  such  algorithms  may  benefit 
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from  information  about  properties  that  are  “satisfied  by  all  plans  with  value  larger  than 
Qb{ 7r).”  Polynomial-time  discovery  of  such  “value  landmarks”  for  arbitrary  OSP  tasks  is 
still  an  open  problem.  However,  looking  at  what  is  needed  and  what  is  available,  here  we 
show  that  classical  planning  machinery  of  reachability  landmarks  actually  can  be  effectively 
exploited  in  OSP. 

In  what  follows,  we  assume  that  the  value  function  of  n  is  additive,  with  u(s)  = 
Yl(v/d)£suv(d),  with  uv(d)  >  0  for  all  variable- value  pairs  {v/d).  That  is,  the  value  of  state 
s  is  the  sum  of  the  (mutually  independent)  non-negative  marginal  values  of  the  propositions 
comprising  s.  With  the  value  of  different  s-plans  in  an  OSP  task  n  varying  between  zero 
and  the  value  of  the  optimal  s-plan  (which  may  also  be  zero),  let  e-landmark  for  state 
s  be  any  property  that  is  satisfied  by  any  s-plan  7 r  that  achieves  something  valuable.  For 
instance,  with  the  disjunctive  action  landmarks  we  use  here,  if  L  C  O  is  an  e-landmark  for 
s,  then  every  s-plan  7 r  with  Qb(ir)  >  0  contains  an  operator  from  L.  In  what  follows,  unless 
stated  otherwise,  we  focus  on  e-landmarks  for  (the  initial  state  of)  n. 

Definition  5  Given  an  OSP  task  n  =  (V,  so,  u;  O,  c,  b),  the  e-compilation  of  U  is  a  clas¬ 
sical  planning  task  n£  =  (Ve,  sq£1  G£;0£ ,  c£)  where 

V£  =  VU{g}, 

with  dom(g)  =  {0, 1}, 
s0e  =  s0  U  {(ff/0)}, 

Ge  =  {{g/l)}, 

Oe  =  O  U  Og  =  O  U  {o^y/d)  I  (v/d)  G  V,  uv(d )  >  0}  , 

with  pre(oWd))  =  {{v/d)}  and  eff (o{v/d))  =  {{g/l)}, 

J  c(w),  UJ  =  o€0 
c£(o)  =  < 

[0,  C 0  0(tj/<1)  ^  Og 

In  plain  words,  n£  extends  the  structure  of  n  with  a  set  of  zero-cost  actions  such  that 
applying  any  of  them  indicates  achieving  a  positive  value  in  n.  Constructing  n£  from  n  is 
trivially  polynomial  time,  and  it  allows  us  to  discover  e- landmarks  for  n  using  the  standard 
machinery  for  classical  planning  landmark  discovery. 

Theorem  15  For  any  OSP  task  n,  any  landmark  L  for  ne  such  that  L  C  O  is  an  e- 
landmark  for  n. 

Proof:  The  proof  is  rather  straightforward.  Let  V  be  the  set  of  all  plans  7r  for  n  with 
Qb(ir)  >  0  and  V£  the  set  of  all  plans  for  ne.  By  the  definition  of  V,  for  any  plan  7r  G  V, 
there  exists  a  proposition  {v/d)  G  V  such  that  uv{d)  >  0  and  {v/d)  G  soM-  Likewise,  since 
So£  :=  so  U  {(5/0)}  and  Oe  D  O,  ir  is  applicable  in  soe.  Hence,  by  definition  of  o/v/d)  G  0£, 
7T-( 0(v/d >)  is  applicable  in  s0e  and  {g/l)  G  that  is,  7r G  V£.  In  turn, 

if  L  is  a  landmark  for  n£,  then  7r -{o(v/d))  contains  an  operator  from  L,  and  if  L  C  O,  then 
7r  contains  an  operator  from  L  as  well.  That  proves  that  all  landmarks  L  for  n£  over  the 
operators  O  of  n  are  e-landmarks  for  n.  □ 
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With  Theorem  15  in  hand,  we  can  now  derive  e-landmarks  for  II  using  any  method 
for  classical  planning  landmark  extraction,  such  as  that  employed  by  the  LAMA  plan¬ 
ner  (Richter  et  ah,  2008)  or  the  LM-Cut  family  of  techniques  (Helmert  &  Domshlak,  2009; 
Bonet  &  Helmert,  2010a).  However,  at  first  glance,  the  discriminative  power  of  knowing 
“what  is  needed  to  achieve  something  valuable”  seems  to  be  negligible  when  it  comes  to  de¬ 
riving  effective  heuristic  estimates  for  OSP.  The  good  news  is  that,  in  OSP,  such  information 
can  be  effectively  exploited  in  a  slightly  different  way. 

Consider  a  schematic  example  of  searching  for  an  optimal  plan  for  an  OPS  task  n  with 
budget  b,  using  BFBB  with  an  admissible  heuristic  h.  Suppose  that  there  is  only  one 
sequence  of  (all  unit-cost)  operators,  n  =  (oi,  02, . . . ,  Ofc+i),  applicable  in  the  initial  state 
of  n,  and  that  the  only  positive  value  state  along  n  is  its  end-state.  While  clearly  no 
value  higher  than  zero  can  be  achieved  in  n  under  the  given  budget  of  b,  the  search  will 
continue  beyond  the  initial  state,  unless  h(so,  •)  counts  the  cost  of  all  the  6  +  1  operators 
of  7 r.  Now,  suppose  that  h(so,  •)  counts  only  the  cost  of  {oj, . . .  , Ofc+i}  for  some  i  >  0,  but 
{01 } ,  {02}, . . . ,  {oj_i}  are  all  discovered  to  be  e-landmarks  for  n.  Given  that,  suppose  that 
we  modify  n  by  (a)  setting  the  cost  of  operators  01,02, . . . ,  o*_  1  to  zero,  and  (b)  reducing 
the  budget  to  6  —  i  +  1.  Since  all  the  operators  01, 02, . . . ,  Oj_  1  anyway  have  to  be  applied 
along  any  value  collecting  plan  for  n,  this  modification  seems  to  preserve  the  semantics  of 
n.  At  the  same  time,  on  the  modified  task,  BFBB  with  the  same  heuristic  h  will  prune  the 
initial  state  and  thus  establish  without  any  search  that  the  empty  plan  is  an  optimal  plan 
for  n.  Of  course,  the  way  n  is  modified  in  this  example  is  as  simplistic  as  the  example  itself. 
Yet,  this  example  does  motivate  the  idea  of  landmark-based  budget  reduction  for  OSP,  as 
well  as  illustrates  the  basic  idea  behind  the  generically  sound  task  modifications  that  we 
discuss  next. 

Definition  6  Let  n  =  (V,  so,  u;  O,  c,  b)  be  an  OSP  task,  C  =  {L\, . . . ,  Ln}  be  a  set  of 
pairwise  disjoint  e-landmarks  for  n,  and  Icost  be  an  admissible  landmark  cost  function  from 
C.  The  budget  reducing  compilation  of  n  is  an  OSP  task  n£  =  (Ve,  sqc,  uc',  Oc,  ce,  be) 
where 

n 

be  =  b  —  ^2  lcost(Li)  (14) 

i=  1 

and 


Vc  =  V  U{vLl,...,vLJ 

with  dom(vLi)  =  {0, 1}, 

«0£  =  so  U  {(vLl/l) {vLJl)}, 

uc  =  u, 

n  n 

Oc  =  O  u  (J  0Li  =  O  u  (J  {o  |  o  g  Li}, 

1=1  2—1 

with  pre(o)  =  pre(o)  U  {{vLJ  1)}  and  eff(o)  =  eff(o)  U  {(yLJ$)}, 

fc(o),  u  =  o  e  O 

:{cu)  =  <  _ 

I  c(o)  —  lcost(Li),  oj  =  o  e  Or, 
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compile-and-BFBB  (II  =  (V,  so,  u;  O,  c,  b )) 
n£  :=  e-compilation  of  II 
C  :=  a  set  of  landmarks  for  II£ 

Icost  :=  admissible  landmark  cost  function  from  C 
n£»  :=  budget  reducing  compilation  of  (C,  Icost)  into  II 
n*  :=  BFBB(II£») 

return  plan  for  II  associated  with  n* 


Figure  11:  BFBB  search  with  landmark-based  budget  reduction 


In  other  words,  II£  extends  the  structure  of  II  by 

•  mirroring  the  operators  of  each  e-landmark  L*  with  their  “cheaper  by  lcost{Li)"  ver¬ 
sions, 

•  using  the  “disposable”  propositions  (v^/1) (ttLn/l)  to  ensure  that  at  most  one 
instance  of  these  discounted  operators  for  each  L,  can  be  applied  along  an  operator 
sequence  from  the  initial  state,  and 

•  compensating  for  the  discounted  operators  for  Lj  by  reducing  the  budget  by  precisely 
lcost(Li). 

This  way,  the  transformation  leads  to  effective  equivalence  between  II  and  II£. 

Theorem  16  Let  II  =  ( V,so,u;0,c,b }  be  an  OSP  task,  C  be  a  set  of  pairwise  disjoint 
e-landmarks  for  II,  Icost  be  an  admissible  landmark  cost  function  from  C,  and  II£  be  the 
respective  budget  reducing  compilation  of  II.  For  every  i r  for  II  with  Qb(ir)  >  0,  there  is  a 
plan  7 r£  for  II £  with  Qbc(nc)  =  Qb(7r),  and  vice  versa. 

The  proof  of  Theorem  16  appears  in  Appendix  A,  p.  48.  The  budget  reducing  OSP-to- 
OSP  compilation  in  Definition  6  is  clearly  polynomial  time.  Putting  things  together,  the 
compile-and-BFBB  procedure  depicted  in  Figure  11: 

(1)  generates  an  e-compilation  II£  of  II, 

(2)  uses  off-the-shelf  tools  for  classical  planning  to  generate  a  set  of  landmarks  C  for  II£ 
and  an  admissible  landmark  cost  function  Icost. ,  and 

(3)  compiles  (C,  Icost)  into  II,  obtaining  an  OSP  task  Il£. 

The  optimal  solution  for  Il£  (and  thus  for  II)  is  then  searched  for  using  a  search  algorithm 
for  optimal  OSP  such  as  BFBB. 

Before  we  proceed  to  consider  more  general  sets  of  landmarks,  a  few  comments  concern¬ 
ing  the  setup  of  Theorem  16  are  now  probably  in  place.  First,  if  the  reduced  budget  be 
turns  out  to  be  lower  than  the  cost  of  the  cheapest  action  applicable  in  the  initial  state,  then 
obviously  no  search  is  needed,  and  the  empty  plan  can  be  reported  as  optimal  right  away. 
Second,  zero-cost  landmarks  are  useless  in  our  compilation  as  much  as  they  are  useless  in 
deriving  landmark  heuristics  for  optimal  planning.  Hence,  Icost  in  what  follows  is  assumed 
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to  be  strictly  positive.  Third,  having  both  o  and  b  applicable  at  a  state  of  IT  brings  no 
benefits  yet  adds  branching  to  the  search.  Hence,  in  our  implementation,  for  each  landmark 
L,  E  C  and  each  operator  o  E  Li,  the  precondition  of  the  regular  operators  o  in  Oe  is  ex¬ 
tended  with  {(vlJ 0)}.  It  is  not  hard  to  verify  that  this  extension  preserves  the  correctness 
of  Iic  hr  terms  of  Theorem  16.  Finally,  if  the  value  of  the  initial  state  is  not  zero,  that  is, 
the  empty  plan  has  some  positive  value,  then  e-compilation  n£  of  n  will  have  no  positive 
cost  landmarks  at  all.  However,  this  can  easily  be  fixed  by  considering  as  “valuable”  only 
propositions  (v/d)  such  that  both  uv(d)  >  0  and  (v/d)  $  so-  For  now  we  put  this  issue 
aside  and  assume  that  Qb{e)  =  0.  Later,  however,  we  come  back  to  consider  this  issue  more 
systematically. 

5.3  Non-Disjoint  e-Landmarks 

While  the  budget  reducing  compilation  n£  above  is  sound  for  pairwise  disjoint  landmarks, 
this  is  not  so  for  more  general  sets  of  e-landmarks.  For  example,  consider  a  planning  task 
n  in  which,  for  some  operator  o,  we  have  c(o)  =  b,  Qb{(o))  >  0,  and  Qb(i r)  =  0  for  all  other 
operator  sequences  n  ^  ( o }.  That  is,  a  value  greater  than  zero  is  achievable  in  n,  but  only 
via  the  operator  o.  Suppose  now  that  our  set  of  e-landmarks  for  n  is  C  =  { L\ , . . .  ,Ln}, 
n  >  1,  and  that  all  of  these  e-landmarks  contain  o.  In  this  case,  while  the  budget  in 
is  be  =  b  —  Xir=i  lcost(Li),  the  cost  of  the  cheapest  replica  b  of  o,  that  is,  the  cost  of  the 
cheapest  operator  sequence  achieving  a  non-zero  value  in  n,  is 


c(o)  —  min lcost(Li)  =  b  —  min lcost(Li)  >  b  —  >  lcost(Li)  =  be- 

?.=  1  i=\  ‘ 


Hence,  no  state  with  positive  value  will  be  reachable  from  soe  in  n£,  and  thus  n  and 
are  not  “value  equivalent”  in  the  sense  of  Theorem  16. 

This  example  shows  why  compiling  non-disjoint  e-landmarks  into  n  independently  is  not 
sound.  In  principle,  this  can  be  repaired  as  follows.  Let  n  =  (V,so,u'0,c,b)  be  an  OSP 
task,  C  =  {L i, . . . ,  Ln }  be  a  set  of  e-landmarks  for  n,  and  Icost  be  an  admissible  landmark 
cost  function  from  C.  All  the  components  in  =  ( Ve ,  soe,  Ac;  Oe ,  cc ,  be)  are  still  defined 
as  in  Definition  6,  except  for  the  operator  sets  Ol1  , . . . ,  O jJn .  The  latter  are  now  constructed 
not  independently  of  each  other,  but  sequentially,  with  the  content  of  depending  on 
the  content  of  all  () ,  j  <  i.  The  ordering  in  which  the  sets  Oe{  are  constructed  can  be 
arbitrary. 

For  each  operator  o  E  O  and  each  1  <  i  <  n,  let  00-i  denote  the  set  of  all  “cost 
discounted”  representatives  of  o  introduced  during  the  construction  of  Olx,  . . . ,  O^-  Given 
that,  for  1  <  i  <  n,  if  for  some  operator  o  E  (Jo'eLj  Oo';i-i,  we  have  cc(o)  =  0,  then 
O-Li  :=  0-  Otherwise,  contains  an  operator  o  for  each  operator 


o  E  Li  U  |^J  00'-i_  i, 
o'  G  Li 


(15) 
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with  o  being  defined  very  similarly  to  Definition  6  as: 


pre(o) 

eff(o) 

cc(o) 


pre(o)  U  {(vLi/l)}, 
eff(o)  U  {(^Lj/O)}, 

{' c(o )  —  lcost(Li),  o  £  Li, 
cc(o)  ~  lcost(Li),  o  £  Uo'eLi  1 


(16) 


The  compilation  extended  this  way  is  sound  for  arbitrary  sets  of  e-landmarks,  and 
on  pairwise  disjoint  landmarks  it  reduces  to  the  basic  compilation  used  in  Theorem  16. 
In  general,  however,  this  extended  compilation  is  no  longer  polynomial  in  the  size  of  the 
explicit  representation  of  II  because 


|  oo.A  =  2^Lj^-i,oeLj^. 


For  example,  let  C  =  {Li,L2,L3},  L\  =  {a,  6},  L2  =  {a,  c},  L3  =  {a,  d}.  Generation  of 
Ol1  ■=  {«i ,  b\ }  effectively  follows  Definition  6,  but  for  Ol2 ,  the  base  set  of  operators  as  in 
Eq.  15  is  already  {a,  c,  ai}.  Thus,  Ol2  ■=  {a2,  01,03},  where,  for  i  £  {2,3}  and  denoting  a 
by  ao,  ai  is  derived  according  to  Eq.  16  from  Hi-2-  Consequently,  the  base  set  of  operators 
for  Ol3  is  {a,  d,a\,  02,03},  resulting  in  Ol3  =  {04,0^1,05,06,07},  where,  for  i  £  {4,  5,6,7}, 
a.;  is  derived  from  01-4.  In  sum,  Il£  ends  up  with  8  =  2^  representatives  of  the  operator  a. 

Since  non-disjoint  landmarks  can  bring  more  information,  and  they  are  typical  to  outputs 
of  standard  techniques  for  landmark  extraction  in  classical  planning,  we  now  present  a 
different,  slightly  more  involved,  compilation  that  is  both  polynomial  and  sound  for  arbitrary 
sets  of  ^-landmarks. 


Definition  7  Let  II  =  ( V,so,u]0,c,b )  be  an  OSP  task,  C  =  {L\, . . . ,  Ln}  be  a  set  of 
pairwise  disjoint  e-landmarks  for  II,  and  Icost  be  an  admissible  landmark  cost  function 
from  L.  For  each  operator  o,  let  C{o)  denote  the  set  of  all  landmarks  in  C  that  contain 
o.  Given  that,  the  generalized  budget  reducing  compilation  of  II  is  an  OSP  task 
n £*  =  C Vc*,s0c*,uc*;Oc*,cc*,bc *)  where  and 


be*  =  b  —  ^2  lcost(Li) 

i= 1 

Vc*  =  V  U  {vLl, . . . ,  vLn  } 

with  domfvLi )  =  {0, 1}, 
s0£*  =  so  U  {(vLl/l) (vLJl)}, 
uc*  =  u, 

Oc*  =  O  U  {o  |  o  £  U lgcL}  U  {get(L)  \  L  £  £} 

with 

pre(o)  =  pre(o)  U  {(vL/l)  \  L  £  £{o)}, 
eff(o)  =  eff(o)  U  {(vl/ 0)  |  L  £  £(o)}, 

and 

pr  e(get{L))  =  {(vl/0)}, 
eff  (get(L))  =  {{vL/l)}, 


(17) 


(18) 
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and 


cc*  (u)  =  < 

ylcost(L), 

Illustrating  this  compilation,  let  C  =  {Li,  L-2,  L$}, 


c(o), 


co  =  o  e  O 


c(o)  -  El££(o)  lcost(L),  UJ  =  0 

co  =  get(L) 


(19) 


Li  =  {a,b}, 
L-2  =  {b,  c}, 
L3  =  {a,  c}, 


with  all  operators  having  the  cost  of  2,  and  let 


lcost(Li)  =  lcost(L  2)  =  lcost(L3 )  =  1. 

In  lie*,  we  have  Ve*  =  VU  , vl2 , ul3}  and 

Oc*  =  o  U  {a,  b,  c,  get(Li),get(L2),get(L3)}, 


with,  e.g., 

pre(a)  =  pre(a)  U  {(vLJ  1) ,  <vl3/1)}, 
eff(a)  =  eff(a)  U  {(vLl/0) ,  {vLJ 0)}, 
cc*  (a)  =  0, 

and,  for  get(L  1), 

pre(get(Li))  =  {(vL/0)}, 
eff(get(Li))  =  {(nL/l)}, 
dc*  (get(Li))  =  1. 

The  intuition  behind  the  compilation  in  Definition  7  is  as  follows.  By  Eq.  19,  applying 
a  discounted  operator  o  saves  the  total  cost  of  all  landmarks  containing  o.  Therefore, 

•  o  is  allowed  to  be  executed  only  at  states  s  in  which  all  the  corresponding  control 
propositions  {(vl/  1)  |  L  £  £(o)}  hold,  indicating  that  the  cost  of  no  landmark  in 
C(o)  has  already  been  saved  before  reaching  s,  and 

•  to  avoid  double  savings  around  £(o),  applying  o  in  s  turns  off  all  these  control  propo¬ 
sitions  in  s[o|. 

However,  considering  the  example  above,  suppose  that  the  optimal  plan  7 r  for  the  original 
task  contains  an  instance  of  operator  a,  followed  by  an  instance  of  operator  b ,  and  no  instance 
of  operator  c.  Applying  a  instead  of  a  would  block  us  from  applying  b  instead  of  b ,  and  thus 
the  value  of  the  optimal  plan  in  the  compilation  can  be  lower  than  Qb(ir).  The  rescue  here 
comes  from  the  get(L)  actions  that  allow  for  selective  “wasting”  of  the  individual  landmark 
costs  lcost(L).  In  our  example,  while  applying  a  in  s  saves  the  cost  of  the  landmarks  L\ 
and  L3,  applying  then  get(Li)  will  waste  lcost(L\)  and  safely  set  the  control  proposition 


36 


(vl1/1)  .  In  turn,  this  will  enable  applying  b  at  the  next  steps,  and  applying  b  will  then  save 
the  cost  of  L-2  and  “re-save”  the  cost  of  L\ .  This  way,  the  compilation  leads  to  effective 
equivalence  between  II  and  II^.,  which  is  formulated  in  Theorem  17  below,  and  proven  in 
Appendix  A,  p.  49. 

Theorem  17  Let  II  =  (V,so,u]0,c,b)  be  an  OSP  task,  C  =  {L\, . . . ,  Ln}  be  a  set  of  e- 
landmarks  for  II,  Icost  be  an  admissible  landmark  cost  function  from  C,  and  II c*  be  the 
(generalized)  budget  reducing  compilation  of  II.  For  every  i r  for  II  with  Qb(Tt)  >  0,  there  is 
a  plan  ttc*  for  Lfc*  with  Qbc *  (ire*)  =  Qb(v r),  and  vice  versa. 

5.4  e-LANDMARKS  &  Incremental  BFBB 

As  we  discussed  earlier,  if  the  value  of  the  initial  state  is  not  zero,  then  the  empty  plan 
has  some  positive  value,  and  thus  the  e-compilation  LL  of  II  as  in  Definition  5  will  have  no 
landmarks  with  positive  cost.  In  passing  we  noted  that  this  small  problem  can  be  remedied 
by  considering  as  “valuable”  only  facts  v  such  that  both  uv(d)  >  0  and  (v/d)  ^  so-  We 
now  consider  this  aspect  of  OSP  more  closely,  and  show  how  e-landmarks  discovery  and 
incremental  revelation  of  plans  by  BFBB  can  be  combined  in  a  mutually  stratifying  way. 

Let  II  =  ( V,so,u]0,c,b )  be  the  OSP  task  of  our  interest,  and  suppose  we  are  given  a 
set  of  plans  7Ti, . . . ,  irn  for  II.  If  so,  then  we  are  no  longer  interested  in  searching  for  plans 
that  “achieve  something,”  but  in  searching  for  plans  that  achieve  something  beyond  what 
7Ti, . . .  ,7 Tn  already  achieve.  Specifically,  let  s*  =  be  the  end-state  of  7T;,  and  for  any 

set  of  propositions  s  CP,  let  goods(s)  C  s  be  the  set  of  all  propositions  (v/d)  £  s  such  that 
uv(d)  >0.  If  a  new  plan  7r  with  end-state  s  achieves  something  beyond  what  iri, . . .  ,nn 
already  achieve,  then,  for  all  1  <  i  <  n, 

goods(.s)  \  goods(si)  /  0. 

We  now  put  this  observation  to  work. 

Definition  8  Given  an  OSP  task  II  =  (V,  sq,  u\  O,  c,  b)  and  a  set  of  reference  states  Sref  = 
{si,...,sn}  of  B,  the  (e,  5ref)-compilation  of  II  is  a  classical  planning  task  = 

(Yei  'SOsj  Ge\ Oe,  c£)  with 

Ve  =  V  U  {xi , . . . ,  xn,  search ,  collect }, 

with  domfxi)  =  dom(search)  =  dom(collect )  =  {0, 1}, 

sq£  =  so  U  {(search/ 1) ,  (collect/ 0) ,  (xi/0) , . . . ,  (xn/0)}, 

Ge  =  «*i/l>  ,  •  •  •  ,  (Xn/l)}, 
n 

Oe  =  O  U  |^J  Oi  U  {finish}, 


•  O  =  {o  |  o  £  O}, 

pre(o)  =  pre(o)  U  {(search/ 1)}, 
eff(o)  =  eff(o), 
c£(o)  =  c(o). 
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Oi  =  {oi,g  I  Si  G  Sref,  g  G  goods(P)  \  s*}, 


pre(oii3)  =  { g ,  (co/Zect/1)}, 
eff(oi)fl)  =  {{xi/ 1)}, 
c£(oi)S)  =  0. 


pre(finish)  =  0, 

eff(/zras/i)  =  {(collect/ 1) ,  (search/0)}, 
c£(finish)  =  0. 


Note  that 

•  the  goal  Ge  cannot  be  achieved  without  applying  the  finish  operator, 

•  the  operators  o  can  be  applied  only  before  finish, 

•  the  subgoal  achieving  operators  can  be  applied  only  after  finish. 

This  way,  the  first  part  of  any  plan  for  II(£  gref)  determines  a  plan  for  II,  and  the  second  part 
“verifies”  that  the  end-state  of  that  plan  achieves  a  subset  of  value-carrying  propositions 
goods(D)  that  is  included  in  no  state  from  5ref.9 

Theorem  18  Let  II  =  (V,  so,u ;  O,  c,  b }  be  an  OSP  task,  Sref  =  {si, . . . ,  sn}  be  a  subset  of 
ITs  states,  and  L  be  a  landmark  for  n(ei£re/)  such  that  L  C  O.  For  any  plan  n  for  II  such 
that  goods(.soM)  \  goods(sj)  /  0  for  all  s^  G  Sref,  ir  contains  an  instance  of  at  least  one 
operator  from  L'  =  {o  |  o  G  L}. 

Proof:  Assume  to  the  contrary  that  there  exists  a  plan  n  =  (oi, . . .  ,Ofc)  for  II  such  that 
goods(so[[vr]]) \goods(sj)  /  0  for  all  G  Sre{,  and  yet  7rnL'  =  0.  Given  that,  let  [g\ , . . . , gn} 
be  an  arbitrary  set  of  propositions  from  goods(so[[7r]])\goods(si), . . . ,  goods(so[[7r])\goods(sn), 
respectively.  By  the  construction  of  II(e)gref),  it  is  immediate  that 

^(e,Sie f)  =  (°i>  •  •  •  i  °ki  finish ,  o\)91 ,  ■  ■  ■ ,  on:gn) 

is  a  plan  for  Il^go  and,  by  our  assumption  about  ir  and  L' ,  it  holds  that  vr(£igref)  n  L  =  0. 
This,  however,  contradicts  that  L  is  a  landmark  for  IT£)gref\.  □ 

Theorem  18  allows  us  to  define  an  iterative  version  of  BFBB,  inc-compile-and-BFBB, 
depicted  in  Figure  12.  The  successive  iterations  of  inc-compile-and-BFBB  correspond  to 
running  the  regular  BFBB  on  successively  more  informed  (e,  S'ref)-compilations  of  II,  with 
the  states  discovered  at  iteration  i  making  the  (e,  5Vef)-compilation  used  at  iteration  i  +  1 
more  informed. 

inc-compile-and-BFBB  maintains  as  a  pair  of  global  variables  a  set  of  reference  states 
iSref  and  the  best  solution  so  far  n*.  At  each  iteration  of  the  loop,  inc-BFBB,  a  modified 

9.  This  “solve  &  verify”  planning  technique  appears  to  be  helpful  in  many  planning  formalism  compilations; 
see,  e.g.,  (Keyder  &  Geffner,  2009). 
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inc-compile-and-BFBB  (II  =  (V,0;  so,c,u,b}) 
initialize  global  variables: 

n*  :=  S o  //  best  solution  so  far 

Sref  :=  {so}  //  current  reference  states 

loop: 

n(£,Sref)  =  (e,  5’ref)-compilation  of  II 
C  :=  a  set  of  landmarks  for  n(£j>gref) 

Icost  :=  admissible  landmark  cost  function  from  L 
Il£*  :=  budget  reducing  compilation  of  (£,  Icost )  into  II 
if  inc-BFBB(IIc» ,  Sre{,  n *)  =  done : 
return  plan  for  II  associated  with  n* 

inc-BFBB  (n,Sref,n*) 

open  :=  new  max-heap  ordered  by  f(n)  =  h(s[n],  b  —  g(n )) 
open.insert(make-root-node(so)) 
closed:=  0  best-cost:=  0; 
while  not  open.empty() 
n  :=  open.pop-max() 
if  goods(s[n])  <2  goods(s')  for  all  s'  E  Sre{: 

‘S'ref  • =  'S'ref  U  { S  [u] } 

if  termination  criterion:  return  updated 
if  /(n)  <  w(s[n*]):  break 
//  the  rest  is  similar  to  BFBB  in  Figure  2 
if  s[n]  0  closed  or  g(n )  <  best-cost(s[n]): 
closed:=  closed  U  {s[n]} 
best-cost(s[n])  :=  g{n) 
foreach  o  E  0(s[n]): 

n'  :=  make-node(s[n][[o]]) 
if  g(n')  >  b  or  f(nr)  <  «(s[n*]):  continue 
if  «(s[n'])  >  rt(s[n*]):  update  n*  :=  n' 
open. insert  (n') 

return  done 


Figure  12:  Iterative  BFBB  with  landmark  enhancement 


version  of  BFBB ,  is  called  with  an  (e,  S’ref)-compilation  of  II,  created  on  the  basis  of  the 
current  pair  of  5ref  and  n* .  The  reference  set  Sre{  is  then  extended  by  inc-BFBB  with  all 
the  non-redundant  value-carrying  states  discovered  during  the  search,  and  n*  is  updated  if 
the  search  discovers  nodes  of  higher  value. 

If  and  when  the  OPEN  list  becomes  empty  or  the  node  n  selected  from  the  list  promises 
less  than  the  lower  bound,  inc-BFBB  returns  an  indicator,  done ,  that  the  best  solution 
n*  found  so  far,  across  the  iterations  of  inc-compile-and-BFBB,  is  optimal.  In  that  case, 
inc-compile-and-BFBB  leaves  its  loop  and  extracts  that  optimal  plan  from  n* .  However, 
inc-BFBB  may  also  terminate  in  a  different  way,  if  a  certain  complementary  termination 
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criterion  is  satisfied.  The  latter  criterion  comes  to  assess  whether  the  updates  to  Sre{ 
performed  in  the  current  session  of  BFBB  warrant  updating  the  (s,  <Sj.ef)-compilation  and 
restarting  the  search.  If  terminated  this  way,  inc-BFBB  returns  a  respective  indicator,  and 
inc-compile-and-BFBB  goes  into  another  iteration  of  its  loop,  with  the  updated  Sre f  and 
n* .  We  note  that,  while  the  optimality  of  the  algorithm  holds  for  any  such  termination 
condition,  the  latter  should  greatly  affect  the  runtime  efficiency  of  the  algorithm. 

5.5  Empirical  Evaluation 

To  evaluate  the  merits  of  the  landmark-based  budget  reducing  compilation,  we  have  ex¬ 
tended  our  Fast  Downward  based,  prototype  OSP  solver  described  in  Section  3  with  the 
following  components: 

•  (e,  <S'ref)-compilation  of  OSP  tasks  II  for  arbitrary  sets  of  reference  states  STef, 

•  Generation  of  disjunctive  action  landmarks  for  (e,  SVefj-compilations  using  the  LM- 
Cut  procedure  (Helmert  &  Domshlak,  2009)  of  Fast  Downward;  and 

•  The  incremental  BFBB  procedure  inc-compile-and-BFBB  as  in  Figure  12,  with  the 
search  termination  criterion  being  satisfied  (only)  if  the  examined  node  n  improves 
over  current  value  lower  bound. 

After  some  preliminary  evaluation,  we  also  added  two  optimality  preserving  enhance¬ 
ments  to  the  search.  First,  the  auxiliary  variables  of  our  compilations  increase  the  di¬ 
mensionality  of  the  problem,  and  this  is  well  known  to  negatively  affect  the  quality  of  the 
abstraction  heuristics  (Domshlak  et  ah,  2012).  Hence,  we  devised  the  projections  with  re¬ 
spect  to  the  original  OSP  problem  n,  and  the  open  list  was  ordered  as  if  the  search  is  done 
on  the  original  problem,  that  is,  by 

h  j  sfn]'*'1  ,  b  —  g{n )  +  lcost(L)  J  , 

\  vL&s[n]  J 

where  is  the  projection  of  the  n^’s  state  s  on  the  variables  of  the  original  OSP  task  n. 
This  change  in  heuristic  evaluation  is  sound,  as  Theorem  17  in  particular  implies  that  any 
admissible  heuristic  for  n  is  also  an  admissible  heuristic  for  Be*,  and  vice  versa.  Second, 
when  a  new  node  n  is  generated,  we  check  whether 

g{n)  +  £  lcost(L)  >  g(n')  +  lcost(L), 

L:(vL/0)£s[n\  L:(vL/0)£s[ri] 

for  some  previously  generated  node  n'  that  corresponds  to  the  same  state  of  the  original 
problem  n,  that  is,  =  s[n]~^.  If  so,  then  n  is  pruned  right  away.  Optimality 

preservation  of  this  enhancement  is  established  in  Lemma  19  and  proven  in  Appendix  A, 
p.  50. 

Lemma  19  Let  n  be  an  OSP  task,  be  a  (e,  Sref)- compilation  of  n,  C  be  a  set  of 

landmarks  for  p  Icost  be  an  admissible  landmark  cost  function  for  C,  and  Be*  be  the 
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Figure  13:  Comparative  view  of  empirical  results  in  terms  of  expanded  nodes,  for  BFBB 
vs.  compile-and-BFBB,  with  (a)  blind  and  (b)  abstraction  hM  heuristics 


respective  budget  reducing  compilation  of  (C,  Icost)  into  IL  Let  and  n2  be  a  pair  of  plans 
for  nc*  with  end-states  si  and  S2,  respectively,  such  that  sjl  =  s^  and 

cc*{ 7Ti)  +  ^2  lcost(L)  >  C£*(7T2)+  lcost(L)  (20) 

L:(vl/0)Gs1  L:(vl/0)£S2 


Q 


Then,  for  any  plan  tt[  that  extends  tt\,  there  exists  a  plan  ir'2  that  extends  112  such  that 
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Figure  14:  Comparative  view  of  empirical  results  in  terms  of  expanded  nodes,  for  BFBB 
vs.  inc-compile-and-BFBB,  with  (a)  blind  and  (b)  abstraction  hM  heuristics 


Our  evaluation  included  the  regular  BFBB  planning  for  II,  solving  II  using  landmark- 
based  compilation  via  compile-and-BFBB,  and  the  straightforward  setting  of  inc-compile-and-BFBB 
described  above.  All  three  approaches  were  evaluated  under  the  blind  heuristic  and  the  ad¬ 
ditive  abstraction  heuristic  h_M  described  in  Section  3.  Figure  7  depicts  the  results  of  our 
evaluation  in  terms  of  expanded  nodes.  Similarly  to  the  experiment  reported  in  Section  3, 
each  task  was  approached  under  four  different  budgets,  corresponding  to  25%,  50%,  75%, 
and  100%  of  the  minimal  cost  needed  to  achieve  all  the  goals  in  the  task,  and  each  run 
was  restricted  to  10  minutes.  Figures  13a  and  13b  compare  the  performance  of  BFBB  and 
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compile-and-BFBB  in  terms  of  expanded  nodes  across  the  four  levels  of  cost  budget,  un¬ 
der  blind  (a)  and  abstraction  hM  (b)  heuristics.  Figures  14a  and  14b  provide  a  similar 
comparison  between  BFBB  and  inc-compile-and-BFBB.  10  Figures  19-22  and  Figures  23-26 
in  Appendix  B  provide  a  more  detailed  view  on  the  results  in  Figure  13  and  Figure  14, 
respectively,  by  breaking  them  along  different  levels  of  cost  budget. 

As  Figure  7  shows,  the  results  were  very  satisfactory.  With  no  informative  heuristic 
guidance  at  all,  the  number  of  nodes  expanded  by  compile-and-BFBB  was  typically  much 
lower  than  the  number  of  nodes  expanded  by  BFBB,  with  the  difference  reaching  three 
orders  of  magnitude  more  than  once.  Of  the  760  task/budget  pairs  behind  Figure  7a,  81 
pairs  were  solved  by  compile-and-BFBB  with  no  search  at  all  (by  proving  that  no  plan  can 
achieve  value  higher  than  that  of  the  initial  state),  while,  unsurprisingly,  only  4  of  these 
tasks  were  solved  with  no  search  by  BFBB. 

As  expected,  the  value  of  landmark-based  budget  reduction  is  lower  when  the  search  is 
equipped  with  a  meaningful  heuristic  (Figure  13b).  Yet,  even  with  our  abstraction  heuris¬ 
tic  in  hand,  the  number  of  nodes  expanded  by  compile-and-BFBB  was  often  substantially 
lower  than  the  number  of  nodes  expanded  by  BFBB.  Here,  BFBB  and  compile-and-BFBB 
solved  with  no  search  39  and  85  task/budget  pairs,  respectively.  Finally,  despite  the 
rather  ad  hoc  setting  of  our  incremental  inc-compile-and-BFBB  procedure,  switching  from 
compile-and-BFBB  to  inc-compile-and-BFBB  was  typically  beneficial.  Obviously,  much  deeper 
investigation  and  development  of  inc-compile-and-BFBB  is  still  required,  especially  around 
the  flexibility  with  respect  to  the  iteration  termination  criterion. 

6.  Summary  and  Future  Work 

Deterministic  oversubscription  planning  captures  the  computational  core  of  one  of  most 
practically  important  setups  of  automated  action  selection,  and  yet,  despite  the  apparent 
interest  in  this  problem,  it  has  not  been  sufficiently  investigated.  In  this  work,  we  stepped 
towards  translating  the  spectacular  advances  in  classical  deterministic  planning  to  deter¬ 
ministic  OSP.  Tracing  the  key  sources  of  progress  in  classical  planning,  we  identified  a  severe 
lack  of  effective  approximations  for  OSP,  and  worked  towards  bridging  this  gap. 

Our  focus  was  on  two  classes  of  approximation  techniques  that  underly  most  state-of- 
the-art  optimal  heuristic-search  solvers  for  classical  planning,  namely  state-space  abstrac¬ 
tions  and  goal-reachability  landmarks.  First,  we  defined  the  notion  of  additive  abstractions 
for  OSP,  studied  the  complexity  of  deriving  effective  abstractions  from  a  rich  space  of  hy¬ 
potheses,  and  reveal  some  substantial,  empirically  relevant  islands  of  tractability  of  this 
abstraction  discovery  problem.  Next,  we  showed  how  standard  goal-reachability  landmarks 
of  certain  classical  planning  tasks  can  be  compiled  into  the  OSP  task  of  interest,  result¬ 
ing  in  an  equivalent  OSP  task  with  a  lower  cost  allowance,  and  thus  with  a,  sometimes 
dramatically,  smaller  search  space. 

All  the  techniques  proposed  here  satisfy  the  properties  required  by  the  efficient  search 
algorithms  for  optimal  OSP.  However,  we  believe  that  these  techniques,  and  especially 
landmark-based  budget  reducing  compilations,  should  be  as  beneficial  in  satisficing  OSP  as 


10.  We  do  not  compare  here  between  the  running  times,  but  the  per-node  CPU  time  overhead  due  to 
landmark-based  budget  reduction  was  <  10%. 
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in  optimal  OSP,  and  this  especially  because  the  difference  between  optimal  and  satisficing 
planning  appears  to  be  much  smaller  in  OSP  than  in  classical  deterministic  planning. 

Many  interesting  questions  remain  open  for  future  work,  and  in  general,  the  prospects 
for  further  developments  in  oversubscription  planning  appear  now  quite  promising.  Within 
the  specific  context  of  our  work,  the  two  most  interesting  questions  for  us  are  optimization  of 
value  partitions  given  cost  partition,  that  is,  optimizing  abstraction  discovery  in  Ap(c,  — ,  — ), 
and  a  thorough  investigation  of  the  interleaved  landmark  discovery  and  search  for  OSP  that 
was  introduced  in  Section  5.4.  Among  the  prospective  research  directions  on  a  broader 
perspective  we  would  like  to  emphasize  the  following  candidates  for  investigation. 

•  Following  the  work  of  Katz  and  Domshlak  (2010a)  on  implicit  abstractions  for  classical 
planning,  it  is  only  natural  to  investigate  the  computational  merits  of  implicit  abstrac¬ 
tions  for  OSP.  In  particular,  this  thread  of  investigation  will  unavoidably  push  towards 
a  better  understanding  of  the  computational  tractability  boundaries  of  deterministic 
OSP. 

•  The  basic  model  of  deterministic  planning  in  Section  2.1  was  used  to  provide  a  unifying 
comparative  view  on  the  basic  models  of  classical,  cost-bounded,  net-benefit,  and 
oversubscription  planning,  but  not  beyond  that.  One  practically  motivated  extension 
of  this  model  is  to  lift  action  costs  to  vectors  of  action  costs.  Such  a  variant  of 
cost-bounded  planning  is  already  investigated  (Nakhost  et  ah,  2012),  and  it  is  only 
natural  to  examine  this  extension  in  the  context  of  OSP.  Note  that,  at  the  level 
of  the  model,  adding  cost  measures  shifts  problem  solving  from  shortest  path(s)  to 
restricted  shortest  path(s)  problems.  While  the  latter  is  already  NP-hard  (Handler  & 
Zang,  1980),  similarly  to  the  Knapsack  problem,  it  can  be  solved  in  pseudo-polynomial 
time  (Desrochers  &  Soumis,  1988). 
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Appendix  A.  Proofs 

Theorem  2  Given  an  OSP  task  n  =  ( V,so,u;0,c,b )  and  a  homomorphic  abstraction 
skeleton  M5  =  {(Gq,  ctq), . . . ,  (Gq,,  oq)}  of  Mu, 

(1)  for  each  cost  partition  c  £  Cp,  there  exists  a  budget  partition  b*  £  Bp  such  that 

A4  (c’U;b  )  ^  ykJ  jor  an  vaj/ue  partitions  u  e  Up. 

(2)  for  each  budget  partition  b  £  Bp,  there  exists  a  cost  partition  c*  £  Cp  such  that 

A4 (c* ’u,b)  for  all  value  partitions  u  £  Up. 

Proof:  Let  ir  =  ((sq,  or,  si),  (si,  02,  S2),  •  •  • ,  (sn_i,  on,  sn))  be  an  optimal  so-plan  for  Mu, 
and,  for  i  £  [k\,  let  ^  =  ((aj(s0),  01,  Oj(si)), . . . ,  (aj(sn_i),  on,  aj(on)))  be  the  mapping  of 
7 r  to  Gi .  Since  AS  is  homomorphic,  the  paths  7rq,  . . .  ,7 iq,  are  well-defined. 
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(1)  Given  a  cost  partition  c  £  Cp,  let  budget  profile  b*  G  B  be  defined  as  b*[z]  = 
S?e[n]  c[*](aj)>  f°r  *  ^  [&].  First,  note  that  b*  G  Bp  since 

x  „  „  (t)  „  (t) 

b*M  =  ^  Y1  c(°i)  -  6’ 

iS[fc]  i£[k]j£[n]  j£[n] 

where  (f)  is  by  c  being  a  cost  partition,  and  (J)  is  by  ir  being  an  so-plan  for  Mu- 
Second,  for  any  u  G  U,  by  the  construction  of  b*,  7q  is  an  «j(so)-plan  for  the  abstract 
model  M-C'u’h  \  Now,  let  u  G  TJp,  and  for  i  G  [/c],  lot  7r*  bo  cm  optimal  G;2(so)-pl&n  for 
M^C’U’b  )_  yye  have 

£  >  £  Qb'[W>  Q‘M,  (21) 

i&[k]  iS[fc] 

where  (f)  is  by  optimality  of  ir*,  and  (|)  is  by  ati(sn )  being  the  end-state  of  7Tj  and  u 
being  a  value  partition.  Therefore,  (c,  u,  b*)  induces  an  additive  abstraction  for  II,  that 
is,  M(c’u'b*)  Gas  Mu- 

(2)  Given  a  budget  partition  b,  let  cost  profile  c*  G  C  be  defined  as  c*[i](o)  =  c(o)  • 

for  all  operators  o  G  O,  and  all  i  G  [k].  First,  we  have  c*  G  Cp  since  b  G  Bp  implies 
l  EiS[fc]  b[i]  G  [0, 1].  Second,  for  any  u  G  U,  by  the  construction  of  c*,  7 q  is  an  cej(so)- 

plan  for  Adjc  ,u,b\  Following  now  exactly  the  same  line  of  reasoning  as  the  one  around 
Eq.  21  above  accomplishes  the  proof  that  A4^c  u,b)  Gas  Mu  for  any  u  G  Up. 
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Lemma  6  The  algorithm  in  Figure  8a  computes  k(u). 

Proof:  Due  to  the  boundness  and  non-emptiness  of  the  polytope  induced  by  C\{m),  the 
termination  of  the  algorithm  is  straightforward.  Thus,  given  a  strong  0-binary  partition  u, 
the  only  question  is  whether  the  value  with  which  the  algorithm  terminates  is  k(u).  First, 
let  us  show  that: 

(f)  For  m  G  [k],  if  x  is  a  solution  of  then  x[£]  <  h  if  and  only  if,  for  each  cost 

partition  c  G  Cp,  there  exists  a  budget  partition  b  G  Bp  such  that  (c,  u,  b)  is  an 
abstraction  for  Mu  and  hM(c,u,b)  (so)  > 

(<(=)  Assume  to  the  contrary  that,  for  each  cost  partition  c  G  Cp,  there  exists  a  bud¬ 
get  partition  b  G  Bp  with  hM(c,u,b)  (so)  >  ma,  and  yet  x[£]  >  b.  Given  the  values  pro¬ 
vided  by  x  to  the  cost  variables  UoeO  icIXI(0)}>  ^  c  be  the  corresponding  cost  partition, 
and  hi, . . . ,  5}.  be  the  induced  lengths  of  the  shortest  paths  from  ai(so),  •  •  •  ,afc(so)  to  re¬ 
valued  states  in  G i, . . . ,  Gk,  respectively.  By  our  assumption,  let  b  be  a  budget  partition 
such  that  hM(c^,b)  (so)  >  ma.  First,  by  the  definition  of  strong  0-binary  value  partitions, 
hM(C,u,b)  (sq)  >  mu  implies  that  there  exists  Z  C  k,  \Z\  =  m ,  such  that,  for  i  G  Z,  b [z]  >  <5,;. 
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Second,  constraint  (10c),  maximization  of  £,  and  the  fact  that  the  only  bound  on  each  Ib[i] 
is  by  5i  imply  together  that,  for  i  &  Z,  x[b[*]]  =  Si.  Putting  things  together,  we  obtain 

beBp  (10c) 

b  >  >  J2Si  =  5Zx[bM]  - 

i£Z  i&Z  i&Z 


contradicting  our  assumption. 

(=>•)  Assume  to  the  contrary  that,  x[£]  <  b,  and  yet  there  exists  a  cost  partition  c  £  Cp 
such  that,  for  all  budget  partitions  b  £  Bp  with  (c,  u,  b)  £  Ap,  we  have  hM(c,Utb)(so)  <  ma. 

Let  the  shortest  path  lengths  <5i, . . .  ,5k  be  defined  as  above,  but  now  with  respect  to 
the  specific  cost  partition  c  from  the  assumption.  Likewise,  let  xc  be  a  solution  to  Ci{m) 
with  an  extra  constraint  on  the  cost  variables  (JoeO  {c[*](o)}  1°  be  assigned  to  c.  Since  the 
objective  in  C\{m)  is  to  maximize  the  value  of  £,  we  have 

*[<£]  >  *c[£].  (22) 


Now,  let 


Z  = 


argrnax  \  .  &i- 
Z'C[k],\Z'\=mieZ, 


Together,  constraint  (10c),  maximization  of  £,  and  the  fact  that  the  only  bound  on  each 
Ib[i]  is  by  Si  (via  the  cost  variables)  imply  that 


xcK]  =  =  12  5i' 

i£Z  i£Z 


(23) 


In  turn,  together  with  x[£]  <  b  and  Eq.  22,  Eq.  23  implies  that 


xc[b[*]],  i  £  Z 
0,  otherwise  ’ 


is  a  budget  partition  with  (c,u,  b)  £  Ap,  and  /i_^(C,u,b)(so)  >  ma,  contradicting  our  as¬ 
sumption. 

Having  proved  the  sub-claim  (f),  which  basically  captures  the  semantics  of  C\{m),  sup¬ 
pose  that  the  algorithm  terminates  within  the  loop,  and  returns  ma  for  some  m  >  0.  By 
the  construction  of  the  algorithm,  if  x  is  a  solution  of  C\(m),  then  x[£]  <  b.  By  (f),  for  each 
cost  partition  c  £  Cp,  there  exists  (c,  u,  b)  £  Ap  such  that  /i(c  u ,b)(s)  >  ma.  If  m  =  k ,  then 
trivially  ks(u)  =  ma.  Otherwise,  if  m  <  k,  we  know  that  the  algorithm  did  not  terminate 
at  the  previous  iteration  corresponding  to  m  +  1.  Again,  (f)  then  implies  that  there  exists  a 
cost  partition  c  £  Cp  for  which  no  (c,  u,  b)  £  Ap  will  induce  h/cpi ,b)(s)  >  (m  +  l)a.  Hence, 
by  the  definition  of  ks(u),  ks(u)  <  (m+  l)a,  and  in  turn,  since  u  is  a  strong  0-binary  value 
partition,  that  implies  ks(u)  =  ma.  Finally,  if  the  algorithm  terminates  after  the  loop  and 
returns  0,  then  precisely  the  same  argument  on  the  basis  of  (f)  implies  ks(u)  =  0.  □ 


Lemma  9  For  any  0  <  e  <  min,^  at,  the  algorithm  in  Figure  9a  computes  ne  such  that 
Ke  —  k( u)  <  e. 
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Proof:  The  arguments  for  the  boundness  and  non-emptiness  of  the  polytope  induced  by 
C-2{v)  are  precisely  the  same  as  for  the  polytope  of  Ci(m)  studied  in  Lemma  6,  and  thus  the 
termination  of  the  algorithm  is  straightforward.  In  what  follows,  we  prove  that  the  value 
returned  by  the  algorithm  satisfies  the  claim  of  the  lemma.  Let  u  be  the  given  0-binary 
partition.  Similarly  to  the  proof  of  Lemma  9,  first  we  prove  a  sub-claim  that: 

(f)  For  v  £  IR0+,  if  x  is  a  solution  of  C,2{v ),  then  x[£]  <  b  if  and  only  if,  for  each  cost 
partition  c  £  Cp,  there  exists  a  budget  partition  b  £  Bp  such  that  (c,  u,  b)  is  an 
abstraction  for  Afn  and  /i^c.u.b)  (so)  >  v. 

The  proof  of  (f )  mirrors  the  proof  of  the  respective  sub-claim  in  Lemma  5,  mutatis  mutandis, 
and  thus  it  is  provided  here  only  for  ease  of  verification. 

(<=)  Assume  to  the  contrary  that,  for  each  cost  partition  c  £  Cp,  there  exists  a  budget 
partition  b  £  Bp  with  hM(c,u,b>  (so)  >  v,  and  yet  x[£]  >  b. 

Given  the  values  provided  by  x  to  the  cost  variables  UoeO  {c[*](o)},  let  c  be  the  cor¬ 
responding  cost  partition,  and,  for  i  £  [k],  let  <5*  be  the  induced  length  of  the  shortest 
path  from  aj(so)  to  the  <7, -valued  states  in  Gj.  By  our  assumption,  let  b  be  a  budget 
partition  such  that  hM( c,u,b)  («o)  >  v.  First,  by  the  definition  of  0-binary  value  partitions, 
hM (C,u,b)  (so)  >  v  implies  that  there  exists  Z  C  k,  Yli^z  —  v  sucb  that,  for  i  £  Z.  b [z]  >  <5,;. 
Second,  constraint  (11c),  maximization  of  £,  and  the  fact  that  the  only  bound  on  each  Ib[i] 
is  by  8 i,  imply  together  that,  for  i  £  Z,  x[lb[i]]  =  8{.  Putting  things  together,  we  obtain 

beBp  (lie) 

b  >  =  5^x[b[i]]  >  C, 

i£Z  i£Z  i£Z 

contradicting  our  assumption. 

(=>)  Assume  to  the  contrary  that,  x[£]  <  b,  and  yet  there  exists  a  cost  partition  c  £  Cp 
such  that,  for  all  budget  partitions  b  £  Bp  with  (c,  u,  b)  £  Ap,  we  have  hM( c,u,b)  (so)  <  v- 
Let  the  shortest  path  lengths  <5i, . . .  ,8k  be  defined  as  above,  but  now  with  respect  to 
the  specific  cost  partition  c  from  the  assumption.  Likewise,  let  xc  be  a  solution  to  C-2{p) 
with  an  extra  constraint  on  the  cost  variables  UoeO  {CU(°)}  to  be  assigned  to  c.  Since  the 
objective  in  C 2(11)  is  to  maximize  the  value  of  £,  we  have 

x[£]  >  Xc[£].  (24) 

Now,  let 

Z  =  argmax  >  8{. 

ieZ' 

zZiez'  °i>v 

Together,  constraint  (11c),  maximization  of  and  the  fact  that  the  only  bound  on  each 
b[i]  is  by  St  (via  the  cost  variables)  imply  that 

=  ^2di-  (25) 

i£Z  i£Z 

In  turn,  together  with  x[£]  <  b  and  Eq.  24,  Eq.  25  implies  that 

bw  =  |Xc[b|i11’  ;  6  z  , 

I  0,  otherwise 
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is  a  budget  partition  with  (c,  u,  b)  £  Ap,  and  hM( c,u,b)  (so)  >  v,  contradicting  our  assump¬ 
tion. 

This  finalizes  the  proof  of  the  sub-claim  (f).  Now,  consider  the  interval  end-points  a  and 
(3  at  the  termination  of  the  while-loop.  If  [3  =  £J*’  then  trivially  k(u)  <  (3.  Otherwise, 

if  P  <  Lie  (Jj,  then,  by  the  construction  of  the  algorithm,  at  some  iteration  of  the  while 
loop,  a  test  always-achievable(/3)  was  issued,  came  negative,  and  thus,  for  the  solutions  of 
C-2(/3),  we  have  x^[^]  >  b.  Hence,  by  (f),  tc(u)  <  (3.  Now,  if  a  /  0,  then,  by  the  construction 
of  the  algorithm,  at  some  iteration  of  the  while  loop,  a  test  always-achievable(a)  was  issued, 
came  positive,  and  thus,  for  the  solutions  xa  of  £2(0)1  we  have  xa[£]  <  b.  Hence,  by  (f), 
k(u)  >  a.  Putting  these  properties  on  a  and  [3  together  with  the  while-loop’s  termination 
condition  (3  —  a  <  e  implies  ne  —  k(u)  =  f3  —  k(u)  <  e.  Finally,  if  a  =  0,  then  e  <  minie[fe]  cr,; 
implies  (3  <  minlgrfc]  <jj.  In  turn,  since  k-(u)  corresponds  to  a  sum  of  values  of  some  states 
in  the  k  models  of  _A/^c’u'b\  k(u)  <  (3  concluded  above  implies  ne  =  k(u)  =  0.  □ 

Theorem  16  Let  n  =  ( V,so,u\0,c,b }  be  an  OSP  task,  C  be  a  set  of  pairwise  disjoint 
e-landmarks  for  n,  Icost  be  an  admissible  landmark  cost  function  from  C,  and  n£  be  the 
respective  budget  reducing  compilation  of  n.  For  every  n  for  n  with  Qb( n)  >  0,  there  is  a 
plan  7 r£  for  n£  with  Qbc(irc)  =  Q6!71");  and  vice  versa. 

Proof:  Let  7 T£  be  a  plan  for  n£,  and  let  7r  be  the  operator  sequence  obtained  by  replacing 
all  operators  o  from  IJILi  along  7 r£  with  the  respective  operators  o  £  O.  By  the 
definition  of  the  action  set  of  n£  in  Eq.  15,  we  have  n  applicable  in  so,  and  soM  = 
■so.d'Tcl  \  Ur=i  dorri{vjJt).  Thus,  Qb(7r)  =  Qbc(^c)-  Likewise,  by,  again,  the  definition  of 
the  action  set  of  n£  in  Eq.  15  and  the  fact  that  no  operator  in  Oc  achieves  the  control 
propositions  {(vl1/  1)  , . . . ,  (vLn/ 1)},  we  have  | H  7T£ |  <  1.  From  that,  we  have 

n 

c(tt)  <  C£(tT£)  +  E  lcost(Li). 

i=  1 

In  turn,  b  =  be  +  Li=i  IcostfLj)  by  Eq.  14,  and  cciFc)  £  b c  by  the  virtue  of  7 r£  being  a 
plan  for  n£.  Therefore,  it  holds  that  c(7r)  <  b,  and  thus  7r  is  a  plan  for  n. 

In  the  opposite  direction,  let  n  be  a  plan  for  n  with  Qb{ 7r)  >  0,  and  let  7r£  be  an 
operator  sequence  obtained  by  replacing,  for  each  e-landmark  L  £  £,  every  first  occurrence 
of  an  operator  from  L  with  the  respective  “cost  reduced”  operator  from  Ol .  It  is  easy  to 
verify  that  7T£  is  applicable  in  so£i  and  that  Qbc(nc)  =  Qb(j r)-  Likewise,  by  the  definition 
of  e-landmarks,  every  L  £  C  will  have  a  presence  along  7r.  From  that,  we  have 

n  n 

c(7Tc)  =  c(7r)  —  ^2  lcost(Li)  <  b  —  ^2  lcost(Li)  =  be, 

1=1  1=1 

where  the  first  equality  is  by  pairwise  disjointness  of  {Li, . . .  ,Ln},  the  inequality  is  by  7r 
being  a  plan  for  n,  and  the  second  equality  is  by  Eq.  14.  Thus,  7 r£  is  a  plan  for  n£.  □ 

Theorem  17  Let  n  =  ( V,so,u]0,c,b )  be  an  OSP  task,  C  =  {L\, . . . ,  Ln}  be  a  set  of  e- 
landmarks  for  n,  Icost  be  an  admissible  landmark  cost  function  from  C,  and  n£*  be  the 
(generalized)  budget  reducing  compilation  of  n.  For  every  n  for  n  with  Qb{ir)  >  0,  there  is 
a  plan  1 r£»  for  n£*  with  Qbc*  (ire*)  =  Qb(ft),  and  vice  versa. 
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Proof:  Let  7r£*  be  a  plan  for  Il£* ,  and  let  n  be  the  operator  sequence  obtained  by  (i)  replacing 
all  operators  o  with  the  respective  operators  oGO,  and  (ii)  removal  of  all  get  operators.  By 
Eq.  17,  we  have  n  applicable  in  sq>  and  soM  =  so£*[I'7r£*J  \  {(^Li/l) ,  •  •  • ,  (vl„/ !)}•  Thus, 
Qb( 7r)  =  Qbc*  (7T£*).  Now,  for  each  e-landmark  Lg£,  let  f(L)  be  the  number  of  instances 
of  the  cost  reduced  counterparts  o  of  the  operators  from  L  along  7 r£*.  By  Eqs.  17  and  18, 
for  each  Lg£,  nc*  must  contain  at  least  £(L)  —  1  instances  of  operator  get(L).  From  that, 
we  have 


c(tt)  <  C£*( 7T£*)  +  E  E  lcost(L)  —  22  (£(L)  —  l)-lcost(L) 

oG7T£*  Z/E£(o)  LEjC 

=  C£*(7T£*)  +  22  f{L)-lco.st{L )  -  22  (£(L)  -  1  )-lcost(L) 

Lee  Lee 

=  C£*(7T£*)  +  22  lcost(L) 

Lee 

<  be*  +  22  ^cos^(T) 

Lee 

=  b, 

and  thus  7r  is  a  plan  for  II. 

In  the  opposite  direction,  let  7r  =  (o\, . . .  ,om)  be  a  plan  for  II  with  Qb( n)  >  0.  By 
the  dehnition  of  e- landmarks,  every  L  £  C  will  have  a  presence  along  ir,  and  let  Of 
1  <  f(i)  <  n,  be  the  first  occurrence  of  an  operator  from  Lj  along  it.  Since  e-landmarks 
in  jC  are  not  necessarily  disjoint,  we  may  have  f{i)  =  / (j )  for  some  pairs  1  <  i  7^  j  <  n. 
Let  p  =  (o(!j, . . . ,  O(fcj),  k  <  n,  be  a  sequence  of  operators  obtained  from  ,  oj(n)} 

by  (a)  duplicate  elimination,  and  (b)  ordering  consistently  with  it.  Let  7 r£*  be  an  operator 
sequence  obtained  from  7r  based  on  p  by 

(1)  replacing  each  ou\  with  o^,  and 

(2)  inserting  right  before  each  do)  an  arbitrary  ordered  sequence  of  actions 

i- 1 

\J{get{L)  |  L  €  C,  {o(j),  o(i)}  C  L}. 
i=l 

It  is  not  hard  to  verify  that  7 r£*  is  applicable  in  -so£*,  and  that  Qbc*  (ttc*)  =  Qb( it).  Now, 
step  (1)  of  expanding  ir  to  7 r£»  (over  all  1  <  i  <  k)  reduces  the  cost  of  the  operator  sequence 
by 

k 

E  E  lcost(L)  =  n(L)-lcost(L), 
i=  1  Le£(ow) 

where  p{L)  =  \L  n  p\.  Step  (2)  of  expanding  7 r  to  7r£»  increases  the  cost  of  the  operator 
sequence  by  Y^Lee  (m(^)  —  l)-lcost(L).  Thus, 

C£*(7T£*)  =  c(tt)  —  22  lcost(L)  <  b  —  22  lcost(L)  =  be*, 

Lee  Lee 
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that  is,  7 r£*  is  a  plan  for  lip*. 


□ 


Lemma  19  Let  II  be  an  OSP  task,  II(£)g  )  be  a  (e,  S ref) -compilation  of  II,  C  be  a  set  of 
landmarks  for  H(£,Sref),  Icost  be  an  admissible  landmark  cost  function  for  C,  and  He*  be  the 
respective  budget  reducing  compilation  of  (. C ,  Icost )  into  II.  Let  tt\  and  7t2  be  a  pair  of  plans 
for  Il£*  with  end- states  si  and  S2,  respectively,  such  that  sjl  =  and 

C£*(7Ti)+  ^  lcost(L)  >  C£*( 7T2)  +  Y,  lcost(L)  (20) 

L:(vL/0)esi  L:(vl/0)GS2 

Then,  for  any  plan  tt\  that  extends  7Ti,  there  exists  a  plan  tt2  that  extends  7t2  such  that 
Qbc*  (tt^)  =  Qbc*( 7ri). 

Proof:  Under  the  notation  in  the  claim,  the  proof  is  by  a  constructive  mapping  of  the  plan 
ir[  to  the  corresponding  plan  vr^. 

First,  we  derive  from  a  plan  p \  for  II  by  (i)  removing  the  finish  operator  and  all 
the  get(-)  operators,  and  (ii)  replacing  all  instances  of  each  discounted  operator  o  with 
instances  of  the  respective  original  operator  o.  This  results  in  a  plan  p\  :=  p\  ■  p\e  for  II 
with  s0 [pi ]  =  s{^  and  c{p\)  =  cp*( 7Ti)  +  J2l-(vl/o)gsi  lcost(L).  To  see  the  latter,  for  each 
operator  u  G  Op*,  let  k(ui)  >  0  denote  the  number  of  instances  of  u  along  7r i  .  Given  that, 
we  have 


c(pi)  =  C£*(7Ti)  -  E  n(get(L))lcost(L )  +  Y^  lcost(L)  Y^  k{o) 

LdC  L(E.jC  o:Z/G>C(o) 


C£*(  7Tl)  +  E  lcost(L) 


Lee 


Y  K(°)  I  “  K(9et(L)) 

,o-.LeC{o) 


(26) 


C£*(7Tl)  +  Y  lcOSt(L)  ■ 


esi 


Lee 

=  C£*(7Ti)+  Y,  lcost(L), 
L:(vL/0)esi 


where  the  second  and  fourth  equalities  are  just  formula  manipulations,  the  first  equality  is 
direct  from  the  construction  of  p\ ,  and  the  third  equality  is  by  the  definition  of  the  budget 
reducing  compilation,  and  specifically,  by  Eqs.  17  and  18. 

Similarly  to  the  construction  of  p\  from  7Ti,  we  can  construct  p2  from  n2,  with  and 
so[P2]  =  4V  and  C(P2)  =  C£*(tt2)  +  Y,L:{vL/0)es2  lcost(L) ■  Thus>  by  Eq.  20,  c(pi)  >  c{p2), 
and  also,  by  the  setting  of  the  claim,  sqIpi]  =  so[p2].  Hence,  p2  =  P2  •  Pie  is  also  a  plan  for 
n,  and  Qb(p[)  =  Qb(p'2)- 

As  the  last  step,  we  now  construct  from  p2  a  plan  tt2  for  n^*  as  in  the  claim.  First,  by 
the  properties  of  7t2  in  the  claim,  the  plan  p2  for  n  achieves  all  the  landmarks  £^S2  =  {L  \ 
(vl/ 0)  G  s2}.  Second,  by  the  definition  of  the  landmark  set  C ,  p\e  must  satisfy  the  rest  of 
the  landmarks,  that  is,  C&S2  =  {L  \  (vl/1)  G  s2}.  Let  us  denote  the  operator  instances  along 
pie  as  (oi, . . . ,  Ofc),  k  =  | pie |,  and  let  {£i, . . . ,  E/J  be  a  partition  of  £gS2  with  £*  C  £es2 
being  the  subset  of  all  landmarks  from  £gS2  for  which  o\  is  their  first  achiever  along  p\e. 
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Given  that,  consider  an  operator  sequence  7T2e  :=  ir^,  recursively  defined  via  tt-0)  =  e, 
and,  if  Ci  =  0,  then  7 =  7r^_1)  •  (07),  else  7rW  =  7r^-1^  -7-  (of),  where  7  is  some  (arbitrary) 
sequencing  of  operators 

{get(L)  |  L  <5  £*  A  (ul/0)  <5  sol^l  tt7r(i_1)]]}- 


Finally,  we  set  iri2  :=  tt-2  •  ir2e- 

By  Eqs.  17  and  18  in  the  definition  of  the  budget  reducing  compilation,  it  is  easy  to 
verify  that  the  above  construction  of  7T2e  ensures  cc*{^2e)  =  c{pie )  —  Yh(VL/\ )es2^cos^^^ 
and  Qbc*  (tt 2e )  =  Qb(P2)-  In  turn,  by  the  properties  of  772,  that  implies  Qbc*  ( tt'2 )  =  Qbc*  (71^) 
and  cc*{tt'2)  =  C£*(7r2)  +  C£*(7r2e). 

Finally,  since 

C£*(vr2)  =  c(p2)  -  ^2  lcost(L) 

(vL/0)es2 


and 


we  have 


C£*(vr2e)  =  c(pie)  -  ^  lcost{L ), 

{vL/l)&S2 

Cc*{ A)  =  c(P2)  +  c(pie)  -  ^  lcO-St(L). 

Lee 


Thus,  since  c{p\)  >  c{p2)  and  p\  =  p\  ■  p\e  is  a  valid  plan  for  II,  we  have 


C£*( 4)  <  c(pi)  +c(pie)  -  £  lcost(L) 

Lee 

<  cW)  -  £  lcost(L) 

Lee 


<  b  —  ^2  lcost(L), 
Lee 


finalizing  the  proof  that  tt'2  is  a  plan  for  Il£*  as  in  the  claim. 


□ 
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Appendix  B.  Detailed  Evaluation  Results 


□  airport 

*  blocks 
+  depot 
o  driverlog 

*  freecell 
a  grid 

l  gripper 
a  logistics 

*  miconic 
O  mystery 
r  openstacks 
©  pipesworld 
x  psr-small 
©  tpp 

®  trucks 
^  rovers 
0  satellite 
a  zenotravel 


(a) 


(b) 


Figure  15:  Restriction  of  the  presentation  in  Figure  7,  p.  19,  to  the  tasks  budgeted  with 
25%  of  the  minimal  cost  required  to  achieve  the  entire  set  of  sub-goals 
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Figure  16:  Restriction  of  the  presentation  in  Figure  7,  p.  19,  to  the  tasks  budgeted  with 
50%  of  the  minimal  cost  required  to  achieve  the  entire  set  of  sub-goals 
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Figure  17:  Restriction  of  the  presentation  in  Figure  7,  p.  19,  to  the  tasks  budgeted  with 
75%  of  the  minimal  cost  required  to  achieve  the  entire  set  of  sub-goals 
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Figure  18:  Restriction  of  the  presentation  in  Figure  7,  p.  19,  to  the  tasks  budgeted  with 
100%  of  the  minimal  cost  required  to  achieve  the  entire  set  of  sub-goals 
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Figure  19:  Restriction  of  the  presentation  in  Figure  13,  p.  41,  to  the  tasks  budgeted  with 
25%  of  the  minimal  cost  required  to  achieve  the  entire  set  of  sub-goals 
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Figure  20:  Restriction  of  the  presentation  in  Figure  13,  p.  41,  to  the  tasks  budgeted  with 
50%  of  the  minimal  cost  required  to  achieve  the  entire  set  of  sub-goals 


57 


□  airport 

*  blocks 
+  depot 
o  driverlog 

*  freecell 
a  grid 

i  gripper 
a  logistics 

*  miconic 
O  mystery 
y  openstacks 
©  pipesworld 
x  psr-small 
©  tpp 

©  trucks 
^  rovers 
0  satellite 
a  zenotravel 


(a)  blind 


BFBB 

(b)  hM 


Figure  21:  Restriction  of  the  presentation  in  Figure  13,  p.  41,  to  the  tasks  budgeted  with 
75%  of  the  minimal  cost  required  to  achieve  the  entire  set  of  sub-goals 
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Figure  22:  Restriction  of  the  presentation  in  Figure  13,  p.  41,  to  the  tasks  budgeted  with 
100%  of  the  minimal  cost  required  to  achieve  the  entire  set  of  sub-goals 
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Figure  23:  Restriction  of  the  presentation  in  Figure  14,  p.  42,  to  the  tasks  budgeted  with 
25%  of  the  minimal  cost  required  to  achieve  the  entire  set  of  sub-goals 
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Figure  24:  Restriction  of  the  presentation  in  Figure  14,  p.  42,  to  the  tasks  budgeted  with 
50%  of  the  minimal  cost  required  to  achieve  the  entire  set  of  sub-goals 
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Figure  25:  Restriction  of  the  presentation  in  Figure  14,  p.  42,  to  the  tasks  budgeted  with 
75%  of  the  minimal  cost  required  to  achieve  the  entire  set  of  sub-goals 
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Figure  26:  Restriction  of  the  presentation  in  Figure  14,  p.  42,  to  the  tasks  budgeted  with 
100%  of  the  minimal  cost  required  to  achieve  the  entire  set  of  sub-goals 
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Abstract 

Achieving  joint  objectives  in  distributed  domain-independent  planning  prob¬ 
lems  by  teams  of  cooperative  agents  requires  significant  coordination  and 
communication  efforts.  For  systems  facing  a  plan  failure  in  a  dynamic  envi¬ 
ronment,  arguably,  attempts  to  repair  the  failed  plan  in  general,  and  espe¬ 
cially  in  the  worst-case  scenarios,  do  not  straightforwardly  bring  any  benefit 
in  terms  of  time  complexity.  However,  in  multi-agent  settings,  the  commu¬ 
nication  complexity  might  be  of  a  much  higher  importance,  possibly  a  high 
communication  overhead  might  be  even  prohibitive  in  certain  domains.  We 
hypothesize  that  in  decentralized  systems,  where  frequent  coordination  is  re¬ 
quired  to  achieve  joint  objectives,  attempts  to  repair  failed  multi-agent  plans 
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should  lead  to  lower  communication  overhead  than  replanning  from  scratch. 

Here,  we  formally  introduce  the  multi-agent  plan  repair  problem.  Building 
upon  the  formal  treatment,  we  present  the  core  hypothesis  underlying  our 
work  and  subsequently  describe  three  algorithms  for  multi-agent  plan  repair 
reducing  the  problem  to  specialized  instances  of  the  multi-agent  planning 
problem.  Finally,  we  present  an  experimental  validation,  results  of  which 
confirm  the  core  hypothesis  of  the  paper.  Our  rigorous  treatment  of  the 
problem  and  experimental  results  pave  the  way  for  both  further  analytical, 
as  well  algorithmic  investigations  of  the  problem. 

Keywords:  Multi-agent  plan  repair,  Decentralized  multi-agent  planning, 
Communication  complexity,  Experimental  evaluation 
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1.  Motivation 

Multi-agent  planning  based  on  classical  planning  is  an  approach  to  con¬ 
structing  control  mechanisms  for  a  team  of  possibly  heterogeneous  autonomous 
agents  which  compute  and  subsequently  execute  plans  for  the  individual 
agents  so  as  to  achieve  some  joint  team  objective  in  the  environment.  When 
an  agent  is  situated  in  a  dynamic  environment,  occurrence  of  various  unex¬ 
pected  events  in  the  environment  might  lead  to  invalidation  of  the  plan,  a 
failure.  A  straightforward  solution  to  this  problem  is  to  invoke  a  planning 
algorithm  and  compute  a  new  plan  from  the  state  the  failure  occurred  in  to 
a  state  conforming  with  its  original  objective. 
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In  general,  replanning  in  the  case  of  a  failure  occurrence  is  a  costly  pro¬ 
cedure,  especially  in  terms  of  its  time  complexity.  In  many  cases,  however,  a 
relatively  minor  fix  to  the  original  plan  would  resolve  the  failure  possibly  at 
a  lower  cost.  Because  it  is  not  clear  what  exactly  are  the  planning  domains 
and  types  of  dynamic  environments  which  would  benefit  from  such  a  repair 
approach,  it  can  be  argued  that  non-informed  plan  repair  attempts  can  in 
many  cases  even  raise  the  overall  complexity  of  the  approach  in  comparison 
to  replanning.  This  would  be  due  to  futile  attempts  to  repair  the  failed  plan 
before  inevitably  falling  back  to  replanning. 

Plan  repair  can  be  seen  as  planning  with  re-use  of  fragments  of  the  old 
plan.  Even  though  there  is  a  number  of  works,  empirically  demonstrating 
that  plan  repair  in  some  domains  performs  better  than  replanning  (e.g.,  [1,  2, 
3]),  Nebel  and  Koehler  in  [4]  theoretically  analyzed  plan  re-use  (plan  repair), 
and  concluded  that  in  general  it  does  not  bring  any  benefit  over  replanning 
in  terms  of  computational  time  complexity.  Taking  into  an  account  the 
performance  of  modern  classical  planners  on  modern  hardware,  the  benefit  of 
repairing  failed  plans  would  often  be  relatively  low  from  the  time-complexity 
perspective. 

In  situated  multi-agent  systems,  however,  the  time  complexity  is  often  not 
of  the  primary  importance.  Consider  application  domains,  such  as  e.g.,  un¬ 
dersea  operations  by  teams  of  coordinated  autonomous  underwater  vehicles. 
While  the  state-of-the-art  technology  allows  to  employ  relatively  powerful 
computers  on  board  of  such  robots,  the  communication  links  are  extremely 
constrained  and  expensive;  wireless  networks  cannot  be  deployed  and  commu¬ 
nication  is  performed  mostly  using  acoustic  signaling.  In  such  applications, 
it  is  the  communication  complexity  of  the  distributed  planning  algorithms 
which  matters  more  than  the  time  complexity.  Consequently,  employment 
of  multi-agent  plain  repair  techniques  can  provide  a  tangible  benefit  over 
replanning  for  a  team  of  robots  whose  multi-agent  plan  fails. 

Study  of  multi-agent  planning  system  in  previous  literature  [5,  6,  7]  reveal 
a  fact  that  local  planning  for  individual  agents  has  usually  lower  computa¬ 
tional  complexity  than  solving  the  global  coordination  problem.  Therefore, 
at  least  intuitively,  plan  re-use  techniques,  which  effectively  simplify  the  co¬ 
ordination  part  of  the  problem,  should  improve  the  time  complexity  and,  as 
the  communication  complexity  is  often  tightly  related  to  the  time  complexity, 
consequently  lower  the  communication  complexity  as  well. 

The  motivation  for  our  research  is  the  intuition  that  multi-agent  plan  re¬ 
pair,  even  though  not  always  the  fastest  approach,  should  under  specific  con- 
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ditions  generate  lower  communication  overhead  in  comparison  to  replanning. 
The  conditions  correspond  to  the  amount  and  frequency  of  minimal  required 
coordination  and  the  types  of  failures  the  environment  generates.  While  the 
hypothesis  is  rather  intuitive,  investigation  of  the  particular  types  of  domains 
and  the  corresponding  suitable  repair  algorithms  deserves  a  deeper  attention. 

The  contribution  of  the  presented  paper  is  threefold.  Firstly,  after  in¬ 
troducing  the  general  problem  of  multi-agent  planning  stemming  from  the 
formulation  in  [6],  we  give  a  rigorous  treatment  to  the  problem  of  multi- 
agent  plan  repair  and  formulate  a  notion  of  relative  coordination  frequency 
of  a  multi-agent  planning  problem.  In  turn,  this  formal  approach  allows 
us  to  state  the  core  hypothesis  of  the  presented  research  in  a  more  formal 
and  precise  manner.  Secondly,  we  propose  three  decentralized  algorithms  for 
multi-agent  plan  repair  reducing  the  problem  to  specialized  instances  of  the 
multi-agent  planning  problem  including  proofs  of  their  correctness.  Finally, 
we  present  experimental  validation  confirming  the  core  hypothesis  of  the  pa¬ 
per.  The  paper  concludes  with  final  remarks  regarding  the  shortcomings  of 
our  approach  and  future  outlooks  in  the  here  described  line  of  research. 

2.  Multi-agent  Planning 

We  treat  the  problem  of  multi-agent  planning  as  an  extension  of  the  clas¬ 
sical  single-agent  planning  in  the  manner  adapted  from  MA-Strips  planning 
in  [6].  We  consider  a  number  of  cooperative  and  coordinated  actors  featuring 
distinct  sets  of  capabilities  (actions),  which  concurrently  plan  and  subse¬ 
quently  execute  their  local  plans  so  as  to  achieve  a  joint  goal.  An  instance 
of  a  multi-agent  planning  problem  is  defined  by  i)  an  environment  character¬ 
ized  by  a  state  space,  ii)  a  finite  set  of  agents,  each  characterized  by  a  set  of 
primitive  actions  (or  capabilities)  it  can  execute  in  the  environment,  iii)  an 
initial  state  the  agents  start  their  activities  in  and  iv)  a  characterization  of 
the  desired  goal  states.  The  following  formal  restatement  of  the  MA-Strips 
problem  and  our  adaptations  thereof  constitute  the  preliminaries  enabling 
us  to  state  the  core  hypotheses  of  the  paper  in  a  formal  manner,  as  well  as 
provide  the  necessary  background  for  the  algorithms  and  their  proofs  intro¬ 
duced  later  in  Section  3.  The  formal  preliminaries  build  upon  the  original 
algorithm  for  the  MA-Strips  problem  proposed  by  Brafman  and  Domshlak 
in  [6]. 

A  state  s  C  C  is  a  set  of  atoms  from  a  finite  set  of  propositions  C  = 
{pi, . . . , Pm } •  Given  p6s,  we  say  that  p  holds  in  s,  otherwise  p  does  not  hold 
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in  s.  In  that  sense,  states  are  complete.  That  means,  it  cannot  happen  that 
there  is  a  p  G  £,  such  that  p's  validity  in  s  is  unknown.  S  =  2C  U  {x}  denotes 
the  set  of  all  states  together  with  a  distinguished  state  x  G  5  denoting  an 
undefined  state. 

A  primitive  action  ( action )  an  agent  can  perform  in  an  environment 
is  a  tuple  a  =  (pre(a),  add(a),  del(a)),  where  a  is  a  unique  action  label  and 
pre(a),  add(a),  del(a)  respectively  denote  the  sets  of  preconditions,  add  effects 
and  delete  effects  of  a  taken  from  some  £  =  {p\, . . .  ,pm}.  Act  denotes  the 
set  of  all  actions  and  we  furthermore  assume  there  is  a  distinguished  empty 
action  e  =  (0,  0,  0)  G  Act  with  no  preconditions  and  no  effects.  Whenever 
pre(a),  add(a),  del(a)  C  £,  we  say  that  a  is  defined  over  £. 

We  say  that  an  action  a  is  applicable  in  a  state  s  iff  pre(a)  C  s.  An 
application  of  a  is  defined  by  the  state  transformation  operator  ©  :  S  x  Act  — > 
S  so  that  s  ©  a  —  (sU  add(a))  \  del(a)  iff  a  is  applicable  in  s.  In  the  case 
a  is  not  applicable  in  s,  s  ©  a  results  in  a  distinguished  undefined  state  x- 
Note,  we  do  not  require  that  add  (a)  D  del  (a)  =  0,  rather  we  simply  assume 
that  the  effects  negate  each  other  strictly  according  to  the  definition  of  ©. 
Furthermore,  ©  is  left-associative,  hence  we  can  write  s  ©  Gq  ©  •  •  •  ©  an. 

An  agent  a  =  {cq, . . . ,  an}  is  characterized  precisely  by  its  capabilities,  a 
finite  repertoire  of  actions  ©  G  Act  it  can  preform  in  the  environment. 

Definition  1  (MA-Strips).  A  multi-agent  planning  problem  is  a  quadruple 
II  =  (£,  A,  s0,  Sg ),  where 


1.  £  is  a  finite  set  of  atoms; 

2.  A  is  a  set  of  agents  ax, . . . ,  an  with  actions  defined  over  £,  featuring, 
besides  the  empty  action  e,  otherwise  mutually  disjoint  sets  of  actions. 
That  is,  ati  D  a3  =  {e},  whenever  i  ^  j : 

3.  so  G  S  is  an  initial  state;  and  finally 

4.  Sg  C  S  is  a  set  of  goal  states. 

From  now  on,  given  a  set  of  agents  A  as  defined  above,  Act  =  (J"=1  a,  denotes 
the  set  of  all  actions  which  can  be  performed  among  the  agents  of  the  team 
A,  the  team  capabilities. 

The  definition  of  a  MA-Strips  multi-agent  planning  problem  is  adapted 
from  the  original  formulation  in  [6].  Before  formally  defining  the  notion  of 
a  solution  to  a  multi-agent  planning  problem,  we  first  introduce  a  sequel  of 
auxiliary  notions. 
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Given  an  agent  a,  a  single-agent  plan  P  is  a  sequence  of  actions  eq, . . . ,  a*,, 
s.t.,  a.i  G  a  for  every  i.  P[i]  denotes  the  i-th  action  in  P,  or  P[i]  =  e  in  the 
case  i  is  larger  than  the  length  of  P,  which  in  turn  will  be  denoted  |P|. 

A  team  of  agents  A  =  an, . . . ,  an  can  act  in  the  environment  concurrently. 
A  joint  action  a  =  (pre(a),  add(a),  del(a))  of  the  team  is  specified  by  a  = 
(ai, . . .  ,an)  a  tuple  of  actions  corresponding  to  the  individual  agents,  that 
is,  ai  G  at  for  each  i,  its  preconditions  pre(a)  =  (J"=]  pre(a*)  and  its  effects 
add  (a)  =  |J”=1  add  (a*)  and  del  (a)  =  |J"=1  del(aj).  a [k]  denotes  the  k- th  action 
of  a.  The  notions  of  action  applicability  in  a  state  s,  as  well  as  application 
of  a  to  s  straightforwardly  extend  from  the  definitions  for  primitive  actions, 
hence  we  can  write  s  ©  a.  Note,  as  we  are  building  on  top  of  the  MA- 
Strips  formalism,  at  this  point  we  do  not  rule  out,  nor  specifically  handle 
joint  actions  in  which  the  effects  of  individual  agents’  actions  cancel  out  each 
other.  In  general,  however,  such  considerations  need  to  be  tackled.  Later  on 
in  this  section,  we  comment  on  such  joint  actions  in  more  detail. 

Definition  2  (multi-agent  plan).  Let  II  =  (£,  A,  s0,  Sg)  be  a  multi-agent 
planning  problem  with  A  =  ai,...,an.  A  synchronous  multi-agent  plan 
V  =  {Pi,...,P„},  consisting  of  single  agent  plans  Pi, ...  ,Pn  respectively 
constructed  from  actions  of  the  agents  ay ,an  is  a  solution  to  II  if  the 
plan  V  satisfies  the  following: 

1.  V  is  well-formed ,  i.e.,  \Pf  =  \P3\  for  all  i,j  <  n.  Additionally,  \P\  = 
\Pj~\  for  every  k  <  n,  denotes  the  length  of  the  multi-agent  plan  P; 

2.  V  is  feasible ,  i.e.,  there  exists  a  sequel  of  states  si, . . . ,  srn,  s.t.  m—  \V\ 
and  Si+i  =  Si  ©  a,;  with  a,  =  (Pi  [?'], . . . ,  Pn[i])  for  all  i  <  m ;  and  finally 

3.  V  reaches  the  goal  Sg ,  i.e.,  sm  G  Sg. 

We  also  say  that  V  solves  the  problem  II.  Finally,  Plans(U)  denotes  the 
set  of  plans  which  are  solutions  to  a  given  multi-agent  planning  problem  II. 
Additionally,  V[k]  denotes  the  joint  action  of  the  team  in  the  step  k  and 
V[k,i\  denotes  the  primitive  action  of  the  agent  i  in  the  step  k. 

This  notation  allows  us  to  introduce  the  following  plan-matrix  notation 
for  a  multi- agent  plan  P,  with  a%3  =  V[i,j\,  providing  a  more  visual  under¬ 
standing  of  multi-agent  plans 
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We  say  that  two  multi- agent  plans  V\,  P2  are  equal  (' V\  =  V2)  iff  they  have 
the  same  length  (\Vi  \  =  P2| )  and  for  all  i  and  j  we  have  =  P2[y  j]- 

A  concatenation  of  two  multi-agent  plans  V\  and  V2  over  the  same  agents 
cei,...,an  is  defined  as  a  plan  V  —  V\  ■  P2-  where  for  each  i  and  j  we 
have  V[i,j]  =  Vi [i,j]  if  i  <  |Pi|  and  V[i,j]  =  V2[i  -  \Vi\,j]  for  i  >  |Pi|. 
Note,  concatenation  of  multi-agent  plans  is  left-,  as  well  as  right-  associative 
operation,  so  we  can  write  V  =  V\  ■  V2 . Vn. 

Given  a  multi-agent  plan  V .  V[i..j ]  denotes  a  fragment  of  V  from  the  step 
i  to  the  step  j.  More  precisely,  V[i..j]  is  a  fragment  of  V  iff  there  exist  multi¬ 
agent  plans  7^ prefix  and  suffix ,  such  that  l^prefix  *  'G \i ■  -  j  •  V suffix  G .  Finally, 
V[i.. 00]  denotes  the  i-th  suffix  of  the  plan  V,  that  is,  V[i..oo\  =  V[i..\P\\. 
Vi  ■  V2  is  said  to  be  a  decomposition  of  a  multi-agent  plan  V  iff  V  —  V\  ■  V2- 

Given  two  multi-agent  plans  V\  and  P2  we  can  define  how  different  they 
are.  diff(P\ ,  V2)  denotes  the  difference  between  Vi  and  V2,  that  is  the  over¬ 
all  number  of  primitive  actions  in  Vi,  which  do  not  correlate  with  the  cor¬ 
responding  primitive  actions  in  V2  and  vice  versa.  diff( Pi,P2)  corresponds 
to  Levensthein  distance  [8],  in  literature  also  referred  to  as  edit- distance,  be¬ 
tween  two  strings  corresponding  to  the  sequences  of  actions  of  the  individual 
plans.  Adaptation  of  the  notion  of  Levensthein  distance  between  two  multi- 
agent  plans  corresponds  to  the  number  of  atomic  edits,  that  is  insertion  of 
an  empty  joint  action,  empty  joint  action  deletion  and  individual  action  re¬ 
placement,  needed  to  transform  one  plan  into  the  other.  The  cost  of  the 
atomic  edits  is  assumed  to  be  equal.  This  model  of  plan  difference  is  also 
closely  related  to  the  MODDELINS  modification  problem  for  single-agent 
plans  described  in  [4], 

To  introduce  the  MA-Plan  algorithm  for  solving  MA-Strips  problems 
as  formulated  in  [6],  we  finally  need  to  distinguish  between  the  public  and 
private  actions  of  individual  agents.  An  action  is  public  whenever  its  pre¬ 
conditions  or  effects  involve  atoms  occurring  in  preconditions  or  effects  of  an 
action  belonging  to  another  agent  of  the  team.  The  private  actions  are  those, 
which  are  not  affected  by  actions  of  the  other  agents. 

Let  atoms(a)  =  pre(a)  U  add  (a)  U  del  (a)  and  similarly  atoms(o;)  be  the 
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sets  of  atoms  required  or  affected  by  the  action  a  and  the  agent  a  respec¬ 
tively.  Given  a  multi- agent  team  A  =  ai,...,an  with  actions  defined  over 
the  set  of  atoms  £,  the  set  of  public  actions  is  defined  as  Actpub  —  {a  |  a  G 
a;  and  atoms(a)  C  C  \  atoms^)}.  Consequently,  the  set  of  private  actions 
is  defined  as  Actprw  =  Act  \  Actpub . 

The  distinction  of  actions  to  private  and  public  turns  out  to  be  an  im¬ 
portant  one.  Since  private  actions  do  not  depend,  nor  are  dependencies  of 
other  actions  performable  by  the  team,  planning  of  sequences  of  private  ac¬ 
tions  can  be  implemented  strictly  locally  by  the  agent  the  actions  belongs  to. 
In  effect,  the  public  actions  become  points  of  coordination  among  the  multi- 
agent  team  members.  The  algorithm  MA-Plan  for  solving  a  planning  problem 
II  can  be  thought  of  in  two  interleaving  stages  until  a  suitable  multi-agent 
plan  is  found:  i)  computation  of  a  plan  consisting  exclusively  of  suitable 
coordination  points  of  the  agent  team,  and  subsequently  ii)  computation  of 
sequences  of  private  actions  filling  the  gaps  between  the  public  actions  of 
each  individual  agent.  While  the  second  stage  can  be  computed  in  a  local 
manner  by  each  individual  agent  without  interactions  with  its  peers,  a  truly 
decentralized  multi-agent  algorithm  for  the  first  stage  requires  a  non-trivial 
amount  of  interaction  between  the  agents. 

One  of  the  main  contributions  of  the  Brafman  and  Domshlak’s  paper  [6] 
lies  in  the  observation  that  the  MA-Plan  algorithm  can  be  implemented  by 
reduction  of  the  first  stage  to  a  constraint  satisfaction  problem  (CSP)  (cf.  e.g., 
[9]).  In  the  CSP,  each  agent  is  represented  by  a  single  variable  ranging  over 
possible  plans  of  the  individual  agent  and  two  types  of  constraints: 

coordination  constraint:  a  sequence  of  joint  actions  V  (candidate  multi¬ 
agent  plan)  corresponding  to  a  multi-agent  planning  problem  II  = 
(£,  A,  s0,  Sg)  satisfies  the  coordination  constraint  iff  for  every  action 
a  =  V[k,i]  performed  by  the  agent  ai:  in  the  step  k  we  have,  that  if  a 
is  a  public  action,  then 

•  for  every  p  G  pre(a),  there  must  exist  ap  =  V[kp,iP\,  such  that 
p  G  add(ap)  and  0  <  kp  <  k  (there  is  some  previous  action  which 
causes  p  to  hold),  or  p  G  s0  in  which  case  we  set  kp  =  1;  and 

•  for  no  k\  s.t.,  kp  <  k'  <  k  there  exists  a'  =  V[k' ,i'},  such  that 
p  G  del  (a')  (p  won’t  be  invalidated  between  causing  it  in  the  step 
kp  and  execution  of  a  in  the  step  k). 
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Algorithm  1  MA-Plan(II): 

Input:  A  multi-agent  planning  problem  II  =  (C,A,s0,Sg). 
Output:  A  multi-agent  plan  V  solving  II,  if  such  exists. 

8=1 

loop 

construct  CSPn;_4 
if  solve-csp(CSPn;<5)  then 

reconstruct  a  plan  V  from  a  solution  for  CSPn;,5 

return  V 
else 

<5  =  5  +  1 

end  if 
end  loop 


The  constraint  ensures  that  the  dependencies  of  all  the  public  actions 
occurring  in  the  overall  multi-agent  plan  are  satisfied,  possibly  by  ac¬ 
tions  performed  in  advance  by  other  team  members. 

internal  planning  constraint:  a  sequence  of  joint  actions  V  correspond¬ 
ing  to  a  multi-agent  planning  problem  II  =  (£,  A ,  s0,  Sg )  satisfies  the  in¬ 
ternal  planning  constraint  iff  for  every  agent,  the  corresponding  single- 
agent  planning  problem  with  landmarks  {a  |  a  =  V[k,i\  <E  Actpub }  is 
solvable,  meaning  a  single-agent  planning  algorithm  is  able  to  fill  in 
the  gaps  between  the  public  actions  in  the  candidate  multi-agent  plan. 
The  constraints  ensure  that  each  individual  plan  is  locally  executable 
by  the  particular  agent. 

Note,  the  formulation  of  the  coordination  constraint  renders  joint  actions 
with  add  (a)  D  del  (a)  ^  0  invalid.  It  is  the  non-strict  inequalities  in  the 
condition  2  of  the  coordination  constraint,  together  with  the  definition  of 
public  actions,  which  ensure  the  local  consistency  of  joint  actions. 

Algorithm  1  lists  the  original  multi-agent  planning  algorithm  MA-Plan 
by  Brafman  and  Domshlak  in  [6].  The  algorithm  iterates  through  CSP  for¬ 
mulations  of  the  planning  problem  according  to  8,  informally  the  number  of 
coordination  points  between  the  agents  in  the  multi-agent  team.  That  means, 
8  determines  the  number  of  joint  actions  in  a  candidate  multi-agent  plan  con¬ 
taining  public  actions.  Filling  the  gaps  between  the  individual  single-agent 
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public  actions,  if  possible,  then  gives  rise  to  the  overall  multi-agent  plan. 
In  the  case  such  a  plan  completion  does  not  exist,  the  process  continues  by 
testing  longer  candidate  plans,  possibly  not  terminating  in  the  case  where  no 
solution  to  the  given  multi-agent  planning  problem  exists. 

The  original  multi-agent  planning  algorithm  assumes  a  centralized  plan¬ 
ning  architecture.  It  is  a  centralized  planning  algorithm  computing  multi¬ 
agent  plans  for  a  team  of  agents  which  are  supposed  to  be  subsequently  exe¬ 
cuted  in  a  decentralized  fashion.  Our  motivation  is  however  a  decentralized 
planning/plan  repair  algorithm  followed  by  a  decentralized  plan  execution. 

In  [10],  Nissim,  Brafman  and  Domshlak  adapted  the  original  blueprint 
algorithm  from  [6]  to  a  distributed  setting.  The  adaptation  rests  on  formu¬ 
lating  the  multi-agent  planning  problem  as  a  distributed  constraint  satisfac¬ 
tion  problem  instance  ( DisCS P)  and  subsequently  utilizing  a  state-of-the-art 
DisCS P  solver  for  solving  it,  plus  managing  the  overhead  involved  in  the  re¬ 
sulting  distributed  algorithm.  From  now  on,  whenever  we  speak  about  the 
implementation  of  the  multi-agent  planning  algorithm,  we  refer  to  its  decen¬ 
tralized  version  as  described  in  [10]. 

As  mentioned  in  previous  sections,  the  multi-agent  planner  from  Nissim, 
Brafman  and  Domshlak  [10]  is  built  on  a  DisCS P  solver  and  a  centralized 
single-agent  heuristic  search  planner.  The  algorithm  used  as  the  DisCS P 
solver  is  a  customized  Asynchronous  Backtracking  (ABT)  solver  [11]  with 
Asynchronous  Forward  Checking  [12]  heuristics.  In  the  planner,  the  best- 
first-search  algorithm  is  employed  with  helpful  action  and  landmark  heuris¬ 
tics.  The  planner  is  a  part  of  the  FastForward  planning  suite  [13].  The 
planning  process  passes  four  separate  phases:  i)  centralized  preparation  of 
the  DisCSP  instance,  ii)  initialization  of  the  solving  process  for  the  DisCSP 
in  a  special  agent  representing  the  goal  requirements,  iii)  decentralized  solv¬ 
ing  of  the  DisCSP  problem  and  iv)  decentralized  finalization  of  the  DisCSP 
solving  process. 

In  the  final  decentralized  phase  the  agents  already  know  their  local  plans, 
given  a  multi-agent  solution  exists,  and  execute  them  in  a  distributed  manner. 

3.  Multi-agent  Plan  Repair 

Consider  a  multi-agent  planning  problem  II  =  (£,A,So,Sg)  and  a  plan 
V  solving  II.  Furthermore,  consider  an  environment  in  which,  apart  from 
the  actions  performed  by  the  agents  of  the  team  A.  no  other  exogenous 
events  occur.  We  say  that  such  an  environment  is  ideal ,  or  non-dynamic. 
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The  execution  of  V  in  such  an  environment  is  failure-free  and  is  uniquely 
determined  by  the  set  of  states  s0, . . . ,  sm,  such  that  si+i  =  s,t  Q)V[i\  (cf.  also 
Definition  2). 

In  dynamic  environments,  however,  it  can  occur  that  in  the  course  of 
execution  of  V.  the  environment  interferes  and  the  execution  of  some  action 
V[i\  from  the  plan  V  does  not  result  in  precisely  the  state  si+i  as  defined 
above.  We  could  say  that  at  step  i  an  unexpected  event  occurred  in  the 
environment.  For  simplicity,  we  consider  only  unexpected  events  happening 
exclusively  in  the  course  of  execution  of  some  action  (as  if  it  took  a  non¬ 
zero  time),  not  such  which  could  occur  while  the  agent  is  deliberating  the 
execution  (as  if  the  deliberation  was  instantaneous). 

Note  that  not  all  unexpected  events  in  dynamic  environments  necessarily 
lead  to  problems  with  execution  of  the  plan  V .  However,  there  are  at  least 
two  cases  of  such  events,  which  can  be  considered  a  plan  execution  failure. 

A  weak  failure  of  execution  of  the  plan  V  at  step  i  w.r.t.  the  multi-agent 
planning  problem  n  is  such,  when  the  state  Sf  resulting  from  an  attempt  to 
perform  the  action  a  —  V[i]  does  not  satisfy  some  of  the  positive  effects  of 
a,  that  is,  add  (a)  Sf. 

A  strong  failure  of  execution  of  the  plan  V  at  step  i  w.r.t.  the  planning 
problem  n  occurs  whenever  the  ?'-th  action  of  V  cannot  be  executed  due 
to  its  inapplicability.  It  means,  the  execution  of  the  plan  up  to  the  step  i 
resulted  in  states  so,  si . . . ,  Sj,  possibly  with  some  weak  failures  occurring  in 
the  course  of  execution  of  the  plan  fragment  and  V[i]  is  not  applicable  in  s*. 

The  weak  and  the  strong  plan  execution  failures  are,  however,  just  two 
examples  of  a  plan  failure.  There  certainly  are  application  domains  in  which 
weak  failures  can  be  tolerated  as  far  as  the  goal  state  is  reached  after  execu¬ 
tion  of  the  multi-agent  plan.  In  practice,  it  makes  the  most  sense  to  monitor 
for  strong  failures  in  system’s  evolution.  Most  weak  failures  either  lead  to  a 
strong  failure  later  on  in  the  plan  execution,  or  were  irrelevant.  Of  course, 
except  for  the  case  when  a  weak  failure  leads  to  a  future  failure  to  reach  a 
goal  state,  which  happens,  when  some  atom  supposed  to  be  included  in  a 
goal  state  fails  to  be  effected  by  an  action  in  the  plan.  There  also  might  be 
domains  in  which  other  types  of  plan  execution  failures  can  occur,  e.g.,  any 
change  of  the  state  not  caused  by  the  involved  agents  can  be  considered  a 
failure  as  well.  Thus,  monitoring  for  weak,  strong,  or  even  other  types  of  plan 
execution  failures  can  strongly  depend  on  the  target  application.  To  account 
for  the  range  of  various  types  failures,  from  now  on,  we  only  require  that  a 
plan  execution  monitoring  process  determines  some  plan  execution  failure  at 
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a  step  i  which  results  in  some  failed  state  Sf. 

Definition  3  (multi-agent  plan  repair).  Let  II  =  (jC,A,s0,Sg)  be  a  multi- 
agent  planning  problem.  A  multi-agent  plan  repair  problem  is  a  quadruple 
E  =  (II ,V,sf,k),  where  V  is  a  multi-agent  plan  solving  the  planning  prob¬ 
lem  II,  k  is  the  step  of  V  in  which  its  execution  failed  and  sj  e  S  is  the 
corresponding  failed  state. 

A  solution  to  the  plan  repair  problem  E  is  a  multi-agent  plan  V\  such 
that  V  is  a  solution  to  the  planning  problem  II'  =  (£,  A,  Sf,  Sg).  We  say 
that  V  repairs  V  in  Sf.  In  the  case  Plans(TL')  =  0,  we  say  that  the  plan  is 
irreparable  given  the  failure  occurring  at  the  state  sj. 

Given  two  multi-agent  plans  V\  and  P2  both  repairing  a  multi-agent  plan 
V  for  a  problem  II  in  a  state  s/,  we  say  that  V\  is  preserving  V  more  than 
V2  iff  diff{V \,V)  <  diff(V-2,V)  and  denote  the  relation  by  V\  <  P2-  The 
minimal  repair  of  the  multi-agent  plan  V  is  such  a  plan  Vm\n  G  Plans (II'), 
which  is  minimal  w.r.t.  the  mutual  differences  between  the  plans  solving  II'. 
That  is, 

P min  e  arg  min  diff{V,V) 

P'ePlans(  IP) 

Note,  there  might  be  several  distinct  minimal  repairs  of  a  given  multi¬ 
agent  plan. 

In  general,  the  multi-agent  plan  repair  problem  can  be  reduced  to  solving 
a  modified  multi-agent  planning  problem  and  thus  gives  rise  to  a  straight¬ 
forward  plan  repair  algorithm  based  on  replanning  in  two  steps:  i)  construct 
the  multi-agent  replanning  problem  II'  as  prescribed  in  Definition  3,  and 
subsequently  ii)  utilize  the  MA-Plan  algorithm  to  solve  the  problem  II'. 

While  the  notion  of  minimal  repair  of  multi-agent  plans  is  based  on  the 
number  of  changes  the  repaired  plan  contains  w.r.t.  the  original  plan,  also 
other  metrics  selecting  distinguished  plan  repairs  could  be  considered.  We 
will  discuss  examples  of  such  later  in  the  paper. 

The  original  motivation  underlying  this  paper  was  the  hypothesis  that 
attempts  to  repair  failed  multi-agent  plans  lead  to  lower  communication  over¬ 
head  than  replanning.  Clearly,  not  all  planning  problems  could  benefit  from 
such  a  mechanism.  Since  we  focus  on  multi-agent  planning  problems,  which 
in  a  sense  enforce  coordination  among  the  members  of  a  multi-agent  team, 
we  need  to  provide  an  indication  of  which  planning  problems  tend  to  ben¬ 
efit  from  the  plan  repairing  approach.  The  following  notion  of  coordination 
frequency  formalizes  the  idea. 
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Definition  4  (coordination  frequency).  Let  II  =  (£,  A,  s0,  Sg)  be  a  multi¬ 
agent  planning  problem  with  a  solution  V.  We  say  that  V  is  8-coordinated 
iff  it  contains  at  least  S  coordination  points,  that  are,  joint  actions  including 
at  least  one  public  actions  of  some  individual  agents.  In  the  case  5  =  0,  that 
is  V  does  not  contain  any  public  action,  we  say  that  V  is  uncoordinated. 


Relative  coordination  frequency  cf(V)  of  a  5-coordinated  plan  V  denotes 
the  frequency  of  coordination  point  occurrence  per  single  step  in  the  plan 
and  is  defined  as 


am  =  M 

Relative  coordination  frequency  c/(II)  of  a  multi-agent  planning  problem 
II  denotes  the  minimal  coordination  frequency  required  to  solve  II  and  is 
defined  as 


cm) 


min  cfCP) 

TG  Plans  (II) 


The  notion  of  relative  coordination  frequency  of  plans  relates  to  the  frac¬ 
tional  amount  of  coordination  corresponding  to  a  single  step  in  a  plan  exe¬ 
cution.  It  straightforwardly  extends  to  planning  problems  viewed  as  sets  of 
plans  solving  them.  We  simply  look  for  solutions  requiring  minimal  relative 
amount  of  coordination  required  to  solve  the  problem.  The  notion  of  relative 
coordination  frequency  allows  for  comparison  and  ordering  of  multi-agent 
planning  problems  according  to  the  amount  of  coordination  they  minimally 
require  for  solving  them.  Informally,  we  will  call  problems  with  relatively 
low  c/(II)  loosely  coordinated  and  those  with  c/(II)  closer  to  1  tightly  coordi¬ 
nated.  Note  that  a  problem  with  cf(U)  =  0.5  is  still  tightly  coordinated,  as 
for  each  coordination  step,  there  is  only  one  uncoordinated  step.  Multi-agent 
planning  problems  with  c/( II)  =  0  will  be  called  uncoordinated. 

Note,  it  still  might  be  the  case  that  even  though  a  multi-agent  planning 
problem  can  be  solved  without  any  coordination  c/(II)  =  0,  there  still  can 
exist  coordinated  plans  in  Plans(Jl),  which  are  more  efficient,  e.g.,  shorter 
than  the  uncoordinated  ones.  For  instance,  consider  a  domain  where  the 
objective  is  that  an  agent  A  reaches  a  destination  d.  The  agent  A  could 
move  from  its  starting  position  to  d  on  its  own,  albeit  slowly  and  resulting 
in  a  relatively  long  plan.  Alternatively,  A  could  be  transported  quickly  to 
d  by  another  agent  B.  The  latter  plan  would  be  shorter  in  terms  of  overall 
number  of  steps,  but  would  require  coordination.  In  result,  repair  of  such 
a  plan  would  be  costlier  in  terms  of  communication  overhead  it  incurs  than 
the  uncoordinated  one,  our  main  concern  here. 
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The  core  hypothesis  of  the  paper  can  be  now  stated  more  formally. 

Hypothesis  1.  Multi-agent  plan  repair  approaches  producing  more  preserv¬ 
ing  repairs  than  replanning  tend  to  generate  lower  communication  overhead 
for  tightly  coordinated  multi-agent  problems. 

A  crisper,  though  perhaps  a  more  challenging  version  of  the  hypothesis 
would  express  the  communication  overhead  in  terms  of  the  average  commu¬ 
nication  complexity. 

Hypothesis  2.  When  applied  to  tightly  coordinated  planning  problems,  multi¬ 
agent  plan  repair  algorithms  producing  more  preserving  repairs  than  replan¬ 
ning  should  feature  a  lower  average  communication  complexity  than  replan¬ 
ning. 

As  shown  in  [6],  5  turns  out  to  play  an  important  role  in  time-complexity 
analysis  of  the  MA-Strips  problem  (cf.  also  Algorithm  1).  Above,  we  hy¬ 
pothesize  that  it  is  the  relative  frequency  of  coordination  points  along  the 
plans,  which  turns  out  to  play  a  role  in  the  communication  complexity  of 
plan  repair.  Plan  repair  for  problems  which  require  some  coordination  quite 
often  along  the  plans  should  lead  to  re-use  of  fragments  including  relatively 
large  number  of  coordination  points,  which  do  not  have  to  be  planned  for 
again  and  thus  leads  to  reduction  of  required  communication  in  the  repair 
process. 

In  the  remainder  of  this  paper,  we  approach  resolution  of  Hypothesis  1. 
Treatment  of  Hypothesis  2  is  beyond  the  scope  of  this  paper  and  is  left  for 
future  work. 

In  the  following  subsections,  we  describe  three  plan  repairing  algorithms. 
Sketches  of  ideas  behind  these  algorithms  were  initially  proposed  in  [14],  in 
the  following  we  provide  novel  descriptions  and  formalizations  of  the  algo¬ 
rithms  in  full  details. 

3.1.  Back- on- track  Repair 

Unexpected  event  occurring  in  an  environment  can  cause  a  failure  in 
execution  of  a  plan  performed  by  a  multi-agent  team  in  that  environment. 
The  result  would  be  that  the  overall  state  of  the  system  won’t  be  the  one 
expected  by  an  undisturbed  plan  execution  at  the  particular  time  step.  A 
straightforward  idea  to  fix  the  problem  is  to  utilize  a  multi-agent  planner 
to  produce  a  plan  from  the  failed  state  to  the  originally  expected  state  and 
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Algorithm  2  Back-on-Track-Repair(E) 

Input:  A  multi-agent  plan  repair  problem  E  =  with 

II  =  (£,  A,  s0,  Sg )  and  a  sequence  of  states  s0, . . . ,  sm,  a  failure-free  execu¬ 
tion  of  V  would  generate. 

Output:  A  multi-agent  plan  V  solving  E,  if  a  solution  exists. 

construct  Aback  ( A .  A,  {No,  ■  ■  ■  ■  O  Sg') 

if  MA-Plan(nbacA.)  returns  a  solution  Vback  then 
retrieve  the  state  Sj  of  V  to  which  Vback  returns 
return  V'  =  Vback  •  V\j  . . .  oo] 

else 

return  V  =  x 
end  if 


subsequently  follow  the  rest  of  the  original  multi-agent  plan  from  the  step  in 
which  the  failure  occurred.  The  following  multi-agent  plan  repair  approach, 
coined  back-on-track  (BoT)  repair,  is  inspired  by  this  idea,  in  fact  a  slight 
generalization  of  it. 

Definition  5  (back-on-track  repair).  Let  E  =  (II ,V,Sf,k)  be  a  multi-agent 
plan  repair  problem  and  II'  =  (£,  A,  sj,  Sg)  being  the  corresponding  modified 
multi-agent  replanning  problem. 

We  say  that  a  plan  V  e  Plans(TL')  is  a  back-on-track  repair  of  V  iff  there 
is  a  decomposition  of  V',  such  that  V  =  Vback  •  V[i..oo\  for  some  i  <  \V\. 

V  =  Vback'V[i..oo\  is  said  to  be  a  proper  back-on-track  repair  iff  |'P[i..oo]  |  > 
0,  i.e. ,  V  preserves  some  non-empty  suffix  of  V. 

Informally,  the  back-on-track  approach  tries  to  preserve  a  suffix  of  the 
original  plan,  prefix  it  with  a  newly  computed  plan  Vback  starting  in  Sf  and 
leading  to  some  state  along  the  execution  of  V  in  the  ideal  environment. 
Note,  all  plans  from  Plans(Jl')  are  back-on-track  repairs  of  the  original  plan. 
The  length  of  the  preserved  suffix  of  the  original  plan  provides  a  handle 
on  ordering  of  the  plans  according  to  the  quality  of  repair.  The  longer  the 
preserved  suffix,  the  more  preserving  the  plan  is.  On  the  other  hand,  even 
when  the  plan  repair  problem  E  is  indeed  solvable,  there  might  not  be  any 
valid  proper  back-on-track  repair  of  the  original  planning  problem. 

Algorithm  2  realizes  a  multi-agent  plan  repair  procedure  according  to  the 
back-on-track  plan  repair  principle.  Since  the  MA-Plan  algorithm  searches 


16 


for  the  shortest  plan  from  the  initial  state  to  a  goal  state,  the  Back-on-Track- 
Repair  computes  plans  which  return  back  to  the  original  one  in  the  shortest 
possible  way.  The  length  of  the  overall  repaired  plan,  however,  depends 
also  on  the  selection  of  a  particular  goal  state  sg  G  {s0, . . . ,  srn}  U  Sg  of  the 
planning  problem  II&ocfc.  If  the  planning  algorithm  selects  sg  according  to  an 
ordering  from  sm  to  s0  and  later  on  the  remaining  states  from  Sg  for  the  same 
lengths  of  possible  7 Dback  plans,  the  overall  repaired  resulting  plan  would  also 
be  the  shortest,  under  a  condition  the  result  is  a  proper  back-on-track  repair. 

The  algorithm  rests  on  invocation  of  the  underlying  multi-agent  planner, 
hence  its  correctness  relies  on  the  correctness  of  the  underlying  planner.  The 
following  lemma  states  the  soundness  of  Algorithm  2. 

Lemma  6  (Back-on-Track-Repair  soundness).  Let  II  =  (C,A,s0,Sg)  be  a 
multi-agent  planning  problem  with  agents  situated  in  a  dynamic  environment 
in  which  the  environment  can  interfere  with  the  plan  execution  and  let  V 
be  a  solution  to  II.  Let  also  sg  be  a  state  resulting  from  an  interference 
of  the  environment,  a  plan  failure,  at  a  step  k  of  execution  of  the  plan  V. 
E  =  (II ,V,Sf,k)  denotes  the  corresponding  multi- agent  plan  repair  problem. 

Unless  the  execution  of  Back-on-Track-Repair(E)  finishes  with  the  unde¬ 
fined  plan  x ,  a  failure-free  execution  of  the  resulting  plan  V  leads  to  some 
goal  state  of  the  original  multi-agent  planning  problem  II. 

Proof.  Follows  straightforwardly  from  the  construction  of  II back  and  that  V 
is  a  solution  to  II.  Either  Vback  leads  to  some  state  along  the  ideal  execution 
trace  of  the  original  plan  V  and  then  the  remainder  of  V  leading  to  the  final 
state  sm  G  Sg  is  reused,  or  a  failure-free  execution  of  Vback  would  lead  directly 
to  some  final  state  send  G  Sg  without  reusing  a  part  of  V.  □ 

Furthermore,  upon  a  failure  of  a  plan  execution,  if  there  exists  a  plan  from 
the  failed  state  to  a  final  state  of  the  original  multi-agent  planning  problem, 
the  back-on-track  algorithm  is  able  to  find  a  solution  to  the  corresponding 
multi-agent  plan  repair  problem. 

Lemma  7  (Back-on-Track-Repair  completeness).  Let  U  =  (£,  A,  s0,  Sg)f  V, 
Sf,k  and  consequently  E  be  as  assumed  in  Lemma  6. 

If  there  exists  a  solution  to  the  modified  multi-agent  planning  problem 
IT  =  (£,  A,  sg,  Sg),  then  the  execution  of  Back-on-Track-Repair(E)  algorithm 
finishes  and  finds  V  x>  a  solution  repair  ofV. 
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Proof.  Again,  follows  straightforwardly  from  construction  of  Tiback  in  the 
algorithm.  Observe  that  if  there  is  a  solution  plan  to  the  problem  II  = 
(C,  A,  Sf,  Sg),  then  there  also  must  exist  at  least  the  same  solution  to  the 
modified  planning  problem  Tlback  =  (A  A,  s/,  {so,  •  •  • ,  sm}  U  Sg).  That  is, 
in  the  worst  case,  the  back-on-track  approach  resorts  to  re-planning  from 
scratch.  □ 

The  lemmas  6  and  7  establish  how  the  back-on-track  plan  repair  approach 
inherits  its  correctness  from  the  underlying  multi- agent  planner.  Note  how¬ 
ever,  the  algorithm  is  only  partially  complete ,  because  in  cases  when  there 
is  no  solution  to  a  given  multi-agent  planning  problem,  it  is  not  ensured 
that  the  algorithm  MA-Plan  terminates.  Provided  a  totally  complete  multi¬ 
agent  planning  algorithm,  directly  replacing  MA-Plan,  total  completeness  of 
the  Back-on-Track-Repair  algorithm  could  be  straightforwardly  established  by 
the  lemmas  above. 

3.2.  Simple  Lazy  Repair 

The  back-on-track  multi-agent  plan  repair  approach  seeks  to  compute  a 
new  prefix  to  some  suffix  of  the  original  plan  and  repair  the  failure  by  their 
concatenation.  An  alternative  approach,  coined  lazy ,  attempts  to  preserve 
the  remainder  of  the  original  multi-agent  plan  and  close  the  gap  between  the 
state  resulting  from  the  failed  plan  execution  and  a  goal  state  of  the  original 
planning  problem. 

Let  Sf  be  the  state  resulting  from  a  failure  in  execution  of  a  multi-agent 
plan  V  in  a  step  k.  We  say  that  a  sequence  of  joint  actions  V  is  an  executable 
remainder  of  V  from  the  step  k  and  the  state  sj  iff  there  exists  a  sequence 
of  states  Sfc, . . . ,  such  that  =  Sf,  Sj+i  =  Si®V'[i  —  k+  1]  and  for  every 
step  i  and  every  agent  j,  we  have  that  V[i  —  k  +  1,  j]  =  V[i,j]  in  the  case 
V[i,j]  is  applicable  in  the  state  sl  and  V'[i  —  k  +  1,  j]  =  e  otherwise.  The 
following  definition  provides  a  formal  definition  of  the  lazy  approach. 

Definition  8  (simple  lazy  repair).  Let  E  =  (U,V,Sf,k)  be  a  multi-agent 
plan  repair  problem  and  II'  =  (A  A,  s/,  Sg)  be  the  corresponding  modified 
multi-agent  replanning  problem. 

We  say  that  a  plan  V  G  Plans  (U')  is  a  lazy  repair  of  V  iff  there  is 
a  decomposition  of  V,  such  that  V  =  V[k..0 0\  ■  Viazy,  where  V\k.. oo]  is  the 
executable  remainder  of  V  from  the  step  /c,  execution  of  which,  starting  from 
Sf ,  results  in  the  state  siazy,  and  Viazy  is  a  solution  to  the  multi-agent  planning 
problem  fl^  =  (£,  A 

Slazyi  ^g)' 
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Algorithm  3  Lazy-Repair(E) 

Input:  A  multi-agent  plan  repair  problem  E  =  (II,  V,  Sf,  k),  with  II  = 

(£,  A,  s0,  Sg). 

Output:  A  multi-agent  plan  V  solving  the  problem  E,  if  a  solution  exists. 

construct  V[k..oo],  the  executable  remainder  of  V[k.. oo]  from  the  state  sj 
simulate  execution  of  V\k..oo]  from  Sf  on,  resulting  in  a  final  state  siazy 
construct  I 1  iazy  (/l,  A,  siazy,  Sg') 

'P  lazy  MA-Plan(II^3/) 

return  V[k..oo]  •  Viazy,  unless  Viazy  =  Xm  which  case  return  x 


Algorithm  3  realizes  multi-agent  plan  repair  based  on  the  lazy  repair 
approach  described  above. 

Similarly  to  the  back-on-track  algorithm,  Algorithm  3  inherits  its  correct¬ 
ness  from  the  underlying  multi-agent  planner  invoked  internally. 

Lemma  9  (Lazy-Repair  soundness).  Let  II  =  (£,  A,  s0,  Sg),  V,  Sf,  k  and  E 
be  as  assumed  in  the  Lemma  6. 

Unless  the  execution  of  Lazy-Repair(E)  finishes  with  the  undefined  plan 
X,  a  failure-free  execution  of  the  resulting  plan  V  leads  to  some  goal  state  of 
the  original  multi-agent  planning  problem  II. 

Proof.  In  whichever  state  siazy  a  failure-free  execution  of  the  executable  re¬ 
mainder  of  V  ends  up,  if  existing,  the  solution  plan  to  the  problem  Yliazy  will 
take  the  system  from  there  to  some  final  state  corresponding  to  the  original 
multi-agent  planning  problem  II.  The  executable  remainder  of  V  from  the 
state  in  which  the  failure  occurred  will  get  reused  in  the  resulting  plan.  □ 

Unlike  the  back-on-track  algorithm,  the  lazy  approach  is  in  general  incom¬ 
plete,  as  it  might  happen  that  the  execution  of  the  executable  remainder  of 
the  original  plan  diverges  to  a  state  from  which  no  plan  to  a  goal  state  exists. 
The  notion  of  the  algorithm  completeness  has  to  be  weakened  to  domains  in 
which  the  agent  team  is  at  least  capable  to  revert  its  own  actions. 

Definition  10  (connected  multi-agent  planning  domain).  Let  II  =  (£,  A,  s0,  Sg ) 
be  a  multi-agent  planning  problem.  Let  also  Act  =  ai  x  •  •  •  x  an ,  with 
ai,...,an  G  A ,  and  S  =  2C .  We  say  that  the  planning  problem  induces 
a  connected  planning  domain  iff  for  every  state  s  G  S  and  a  joint  action 
a  G  Act ,  there  exists  a  solution  to  the  multi-agent  planning  problem  II'  = 
(£,  A,  s©a,  s),  i.e,  a  plan  V  —  a1; . . . ,  afc,  such  that  s  =  s©a©a!©-  •  -©afc. 
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In  essence,  the  definition  of  connected  multi-agent  planning  domain  states 
that  it  is  in  the  scope  of  capabilities  of  the  multi-agent  team  A  to  “undo”, 
or  “revert”,  effects  of  any  of  its  own  actions.  Note,  a  single-agent  version  of 
the  definition  (with  an  omnipotent  agent  a  =  UQie.4  a*)  would  also  suffice, 
since  we  require  that  e  G  a  for  every  a  G  A  and  in  a  consequence  any  joint 
action  of  the  team  can  be  transformed  into  a  corresponding  multi-agent  plan 
of  length  n  with  only  a  single  agent  acting  in  any  given  step  of  the  plan. 

The  following  lemma  states  that  the  Lazy-Repair  algorithm  is  complete  in 
connected  planning  domains. 

Lemma  11  (Lazy-Repair  completeness).  Let  II  =  (£,  A,  s0,  Sg)  inducing  a 
connected  multi- agent  planning  domain  and  we  assume  V ,  sj,  k,  as  well  as 
E  are  as  in  the  Lemma  6.  Let  also  siazy  correspond  to  the  state  to  which 
a  failure-free  execution  of  an  executable  remainder  V[k..oo\  ofV[k.. oo]  ivould 
lead. 

If  there  exists  a  solution  plan  V  to  the  multi-agent  planning  problem 
IT  =  (£,  A,  Sf,  Sg),  then  the  execution  of  Lazy-Repair(E)  algorithm  finishes 
and  fi,nds  a  plan  V*  x  >  a  solution  repair  ofV. 

Proof.  Let  V\k..oo\  —  afc+i,  •  •  • ,  a-m  be  the  executable  remainder  of  V[k.. oo] 
and  let  s^+i, . . . ,  sm  be  the  states  resulting  from  a  failure-free  execution  of 
'P[k..oc\i  he.,  Sy+i  =  sj  ©  aj  for  k  +  1  <  j  <  m.  Since  the  agent  team  acts 
in  a  connected  planning  domain,  any  of  its  actions  is  reversible,  that  is,  its 
effects  can  be  undone.  Therefore  for  execution  of  each  action  above,  there 
must  exist  a  sequence  of  plans  Vff,  each  being  a  solution  to  the  planning 
problem  II =  (£,  A,  sJ+i,  Sj).  Since  we  assume  that  there  exists  a  plan  V 
solution  to  the  problem  II'  =  (£,  A,  Sf,  Sg),  the  plan  V*  =  V[k..oo]  •  ' 

•  •  •  •  Vff+1  ■  V  is  a  solution  for  the  plan  repairing  problem  E  =  (II,  V,  sj,  k ). 
That  is,  the  solution  plan  first  executes  V[k.. oo],  the  executable  remainder  of 
the  original  plan  V  from  the  point  of  failure  (as  defined  by  the  algorithm), 
then  “undoes/reverts”  effects  of  all  the  performed  actions  in  V[k..oo\  and  thus 
returns  to  the  state  Sf,  and  finally  executes  the  plan  V ,  existence  of  which 
we  assume.  □ 

The  corollary  of  the  line  of  reasoning  leading  to  the  proof  of  completeness 
of  the  lazy  repair  approach  is  that  despite  non-existence  of  irreversible  envi¬ 
ronment  interferences  in  some  domains,  it  is  the  agent  team  whose  actions 
can  break  the  system  evolution  beyond  repair.  For  illustration,  even  though 
it  is  not  in  the  ability  of  the  physical  environment  to  push  a  robot  over  a 
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cliff,  it  is  indeed  in  its  own  powers  to  jump  from  it  during  execution  of  an 
executable  remainder  of  some,  otherwise  harmless  plan,  which  failed  shortly 
before.  In  such  domains,  the  lazy  approach  has  to  be  employed  with  caution. 

To  conclude,  similarly  to  the  back-on-track  approach,  Lemma  11  states 
only  partial  completeness  of  the  Lazy-Repair  algorithm  the  underlying  multi¬ 
agent  planner  does  not  ensure  termination. 

3.3.  Repeated  Lazy  Repair 

In  a  dynamic  environment,  plan  failures  occur  repeatedly.  Even  after  a 
repair  of  a  failed  plan,  it  is  possible  for  the  repaired  plan  to  fail  again.  In 
this  situation  both  the  back-on-track,  as  well  as  the  lazy  multi-agent  plan 
repair  algorithms  lead  to  prolonging  the  really  executed  plan.  In  the  case 
of  the  back-on-track  approach,  this  is  inevitable,  since  upon  the  repair,  the 
subsequent  plan  execution  process  immediately  processes  the  newly  added 
plan  fragment.  In  the  case  of  the  lazy  repair,  however,  upon  occurrence  of 
another  failure  during  execution  of  an  already  repaired  plan,  it  is  not  always 
necessary  to  prolong  the  overall  multi-agent  plan.  In  the  case  a  second  failure 
occurs  while  still  executing  the  plan  fragment  from  the  original  plan  preserved 
by  the  first  repair,  the  suffix  appended  by  the  first  repair  can  be  discarded 
and  replaced  by  a  new  plan  suffix  repairing  the  second  failure,  should  it  be 
necessary. 

The  following  definition  formally  introduces  repeated  lazy  (RLazy)  plan 
repair ,  an  extension  of  the  lazy  multi-agent  plan  repair  approach  introduced 
in  Definition  8.  For  clarity,  from  now  on,  we  refer  to  the  lazy  multi-agent 
plan  repair  introduced  in  the  previous  subsection  as  simple  lazy  repair. 

Definition  12  (repeated  lazy  repair).  Let  E  =  (II,  V,  Sf,  k )  be  a  multi-agent 
plan  repairing  problem.  Let  also  II  =  (£,  A,  s0,  Sg)  be  the  corresponding 
multi- agent  planning  problem  with  a  solution  of  the  form  V  —  V  ■  Vfix.  In 
the  case  this  is  the  first  failure  encountered  during  execution  of  V,  we  have 
\Vfix\  =  0  and  thus  V  =  V .  Otherwise,  V  is  a  simple  lazy  repair  solution  of 
some  (previously  solved)  plain  repair  problem  Ep  =  (II,  Vp,  Sfp,  kp)  composed 
of  an  executable  remainder  of  Vp  (represented  as  V)  and  a  repair  suffix  Vfix. 

We  say  that  V"  is  a  repeated  lazy  repair  of  V  iff 

1.  V"  is  a  simple  lazy  repair  solution  to  S'  =  (II,  V',  s/p,  k )  in  the  case  k  < 
\V'[kp..oo\  |  (the  failure  occurred  still  within  the  executable  remainder 
of  Vp[kp.. oo]);  or  otherwise 
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2.  V"  is  a  simple  lazy  repair  solution  to  S'  =  (II ,V,Sfp,  k ). 

The  repeated  lazy  repair  leads  to  a  straightforward  extension  of  the  simple 
lazy  plan  repair  algorithm  listed  in  Algorithm  3.  The  intuitive  benefit  of  the 
straightforward  application  of  the  repeated  lazy  repair  approach  is  that  it 
should  lead  to  shorter  executed  plans  than  would  result  from  usage  of  the 
simple  lazy  repair.  Consider  a  plan  execution  failure  at  step  k\  of  a  plan  V . 
Simple  lazy  repair  approach  would  fix  it  by  appending  a  suffix  V\  resulting 
in  the  plan  T\k1..oo\  •  Pi-  Simple  lazy  repair  of  a  second  failure  at  a  step  k2 
occurring  still  somewhere  in  the  fragment  Vyk^.oo]  would  result  in  a  solution 
V\k2..o o]  •  Vi  ■  V-2  with  a  suffix  V-2,  the  solution  to  the  second  plan  repair 
problem.  Unlike  that,  upon  occurrence  of  the  second  failure  the  repeated 
lazy  repair  discards  the  previously  computed  suffix  V\  and  replaces  it  with  a 
new  suffix  V2.  resulting  in  a  repair  solution  V[k2..oo\  •  P2-  The  idea  is  that  in 
many  domains  V2  should  be  shorter  than  the  length  of  the  combined  suffix 
Vi  ■  V2.  This  could  be  especially  beneficial  in  domains  in  which  subsequent 
failures  can  even  revert,  or  otherwise  fix  the  ones  occurring  previously. 

Algorithm  4  Repeated-Lazy-Repair(E) 

Input:  A  multi-agent  plan  repairing  problem  E  =  (II ,V,Sf,k)  with  II  = 
(jC.A,  s0,Sg)  and  its  solution  V.  In  the  case  V  is  a  lazy  repair  solution  of 
a  (previously  solved)  plain  repair  problem  Ep  =  (II,  Vp,  Sfp,  kp),  it  takes 
the  form  V  =  V  ■  Vfix-  Otherwise,  in  the  case  this  is  the  first  failure 
encountered,  \Vfix\  =  0. 

Output:  A  multi-agent  plan  solving  E  =  (II,  V,  Sf,  k). 

if  k  <  \V[kp..oo]\  then 

return  Lazy-Repair((II,  V',  Sf,  k)) 
else 

return  Lazy-Repair((II,  V,  Sf,  k)) 
end  if 


As  with  the  previous  two  plan  repairing  approaches,  we  conclude  the 
discourse  with  proofs  of  repeated  lazy  repair  approach  correctness. 

Lemma  13  (Repeated-Lazy-Repair  soundness).  Let  II  =  (£,  A,  s0,  Sg),  V,  Sf, 
k  and  E  be  as  assumed  in  the  Lemma  6. 

Unless  the  execution  of  Repeated-Lazy-Repair(E)  finishes  with  the  unde¬ 
fined  plan  x ,  a  failure-free  execution  of  the  resulting  plan  V  leads  to  some 
goal  state  of  the  original  multi-agent  planning  problem  II. 
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Proof.  Follows  immediately  from  the  soundness  of  the  simple  lazy  repair 
approach  in  Lemma  9.  □ 

Lemma  14  (Repeated-Lazy-Repair  completeness).  Let  II  =  (C,A,so,Sg)  in¬ 
ducing  a  connected  multi-agent  planning  domain  and  we  assume  V ,  Sf,  k,  as 
well  as  E  are  as  in  the  Lemma  6. 

If  there  exists  a  solution  plan  to  the  multi-agent  planning  problem  II'  = 
(C,A,Sf,Sg),  then  the  execution  of  Repeated-Lazy-Repair(E)  algorithm  fin¬ 
ishes  and  finds  a  plan  V'  x  >  a  solution  repair  ofV. 

Proof  Again,  follows  straightforwardly  from  the  proof  of  completeness  of  the 
simple  lazy  repair  algorithm.  Note,  the  proof  of  Lemma  11  is  independent 
of  how  exactly  does  the  final  state  to  which  the  executable  remainder  of 
the  original  plan  leads  to  looks  like,  it  can  be  arbitrary.  Therefore,  when 
we  arbitrarily  modify  the  executable  remainder  of  the  original  plan,  as  in 
Algorithm  4,  the  proof  still  holds.  That  is,  if  there  exists  a  plan  V"  from  Sf  to 
some  state  in  Sg,  then  in  connected  domains,  there  must  exist  at  least  the  plan 
firstly  executing  the  executable  remainder  of  the  original  plan,  subsequently 
a  plan  reverting  its  effects  back  to  Sf  and  than  finally  performing  the  steps 
of  V".  "  □ 

4.  Multi-agent  Plan  Repairing  Process 

The  multi-agent  planning,  executing,  monitoring  and  repairing  process 
has  two  phases.  In  the  first  phase,  for  a  given  domain  a  multi-agent  plan  is 
constructed  using  the  MA-Plan  algorithm.  In  the  second  phase,  the  plan  is 
executed  by  the  agents  acting  in  a  shared  environment.  In  the  course  of  the 
execution,  the  dynamics  of  the  environment  can  interfere,  possibly  resulting 
in  a  failure  of  the  executed  plan.  Since  the  plan  execution  is  monitored  by  the 
multi-agent  team,  or  a  centralized  observer,  upon  a  failure  detection  a  plan 
repair  algorithm  is  invoked.  In  turn,  to  find  particular  repairing  plans,  the 
MA-Plan  algorithm  is  invoked  as  specified  by  the  plan  repairing  algorithms 
introduced  in  the  previous  section. 

Scheme  listed  in  Algorithm  5  shows  the  pseudo-code  of  the  process.  Since 
we  assume  complete  information  there  is  no  difference  between  a  decentral¬ 
ized  and  a  centralized  monitoring,  hence  for  clarity,  the  algorithm  instantiates 
the  centralized  version.  As  a  consequence  of  the  information  completeness 


23 


Algorithm  5  Plan  execution  and  monitoring  scheme. 

Input:  An  initial  multi-agent  planning  problem  II  =  (£,A,So,Sg). 

V  =  MA-Plan(II) 

if  V  —  x  then  return  fail 

k  —  1 

repeat 

agents  perform  V[k\ 
if  failure  detected  then 

retrieve  the  current  state  s  from  the  environment 
V  =  Repair((n,  V,  s,  k)) 
k  =  1 
else 

k  =  k  +  1 

end  if 

until  V  —  x  or  k  >  \P\ 


assumption,  also  the  execution  of  the  centralized  initialization  of  the  MA- 
Plan  algorithm  does  not  negatively  affect  the  amount  of  communication  in 
the  system. 

Before  execution  of  each  plan  step,  the  algorithm  checks  whether  a  failure 
occurred  and  if  so,  invokes  a  plan  repair  algorithm.  We  do  not  explicitly 
articulate  what  a  failure  amounts  to,  since  this  can  be  application  specific. 
Plausible  options  include  checking  for  weak,  or  strong  failures,  i.e,  validity  of 
effects  of  the  previously  executed  action,  or  validity  of  preconditions  of  the 
action  to  be  executed  next.  Alternatively,  in  some  applications  it  might  be 
useful  to  check  for  any  exogenous  change  of  the  current  state  not  caused  by 
the  involved  agents. 

Finally,  the  algorithm  accounts  for  the  possibility  that  the  plan  repairing 
process  can  result  in  finding  no  solution  to  the  failure.  If  that  is  the  case,  the 
algorithm  finishes  with  the  final  plan  equal  to  the  undefined  plan  y.  Note 
however,  that  Algorithm  5  does  not  necessarily  terminate.  Termination  of  of 
the  scheme  relies  on  two  factors.  Firstly,  it  is  the  termination  property  of  the 
underlying  multi-agent  planner  invoked  by  the  plan  repair  algorithms  dis¬ 
cussed  in  Section  3.  Secondly,  unless  no  repair  to  the  occurred  failure  can  be 
found,  the  algorithm  terminates  when  it  is  capable  to  fully  execute  the  com- 
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puted  plan.  In  environments  where  failures  can  occur  relatively  frequently,  it 
can  however  happen  that  the  plan  execution,  monitoring  and  repair  process 
would  continually  repair  recurring  failures  sooner  than  the  previous  repair 
was  fully  executed.  In  a  consequence,  this  would  lead  to  a  gradual  prolonga¬ 
tion  of  the  executed  plan  so  that  it  will  never  reach  the  end  of  its  execution. 
Informally,  for  such  domains,  we  could  state  that  the  Algorithm  5  terminates 
when  the  plan  repair  process  generates  sharply  shorter  repaired  plans  than 
is  the  time  horizon  in  which  the  failures  in  the  environment  tend  to  occur. 
Results  in  [15]  discuss  steps  towards  a  formal  analysis  of  such  a  planning 
horizon  and  classification  of  various  planning  domains  with  respect  to  the 
frequency  of  failures  occurring  in  an  environment  and  the  likelihood  that  an 
agent  completes  its  plans  without  an  interruption. 

Instantiation  of  the  execution,  monitoring  and  repair  scheme  with  the  re¬ 
peated  lazy  repair  algorithm  allows  for  an  alternative  plan  execution  model. 
The  planning  process  invocation  in  the  repair  algorithm  could  be  delayed 
until  the  execution  of  the  preserved  fragment  of  the  original  plan  finishes. 
Such  an  approach  could  preserve  significantly  longer  fragments  of  the  orig¬ 
inal  plan  than  instantiation  of  the  original  scheme  in  Algorithm  5  with  the 
Repeated-Lazy-Repair  algorithm.  That  is,  upon  a  failure,  instead  of  trying  to 
repair  the  failed  plan  right  away,  as  both  the  back-on-track  and  simple  lazy 
plan  repair  algorithms  invoked  from  the  listed  plan  execution  scheme  would 
do,  the  system  can  simply  proceed  with  execution  of  the  remainder  of  the 
original  plan  and  only  after  it  finishes,  the  lazy  plan  repair  is  triggered.  The 
approach  simply  ignores  the  plan  failures  during  execution  and  postpones 
the  repair  the  very  end  of  the  process,  hence  the  $ lazy”  label  for  the  two 
algorithms.  In  some  domains,  such  an  approach  could  significantly  decrease 
the  number  of  multi-agent  planner  invocations  and  in  a  consequence  save  a 
large  amount  of  communication  overhead. 

5.  Experimental  Validation 

To  verify  the  Hypothesis  1,  we  conducted  a  series  of  experiments  with 
implementations  of  the  multi-agent  plan  repair  algorithms  described  in  the 
previous  section.  Below,  firstly,  we  describe  the  experimental  setup  used  for 
the  experiments,  then  we  interpret  the  data  collected  and  finally,  we  revisit 
Hypothesis  1. 
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5.1.  Experimental  Setup 

The  experiments  were  based  on  a  presented  two-stage  plan  repairing  al¬ 
gorithm,  where  we  distinguish  two  types  of  plan  failures:  action  failures  and 
state  perturbations.  Both  failure  types  are  parametrized  by  an  uniformly  dis¬ 
tributed  probability  P,  which  determines  whether  a  simulation  step  fails,  or 
not  (a  failure  is  generated  only  if  there  exists  a  plan  to  a  goal,  which  obvi¬ 
ate  problems  with  irreversible  actions).  Both  failure  types  are  weak  failures. 
That  is,  they  are  not  handled  immediately,  but  can  preclude  the  plan  execu¬ 
tion  and  later  result  in  a  strong  failure.  Upon  detection,  a  strong  failure  is 
handled  by  one  of  the  plan  repairing  algorithms. 

An  action  failure  is  simulated  by  not-execution  of  some  of  the  individual 
agent  actions  from  the  actual  plan  step.  The  individual  action  is  chosen 
according  to  a  uniform  probability  distribution  over  the  positions  within  a 
joint  action.  The  individual  failed  action  is  then  removed  from  the  joint 
action  and  the  current  state  is  updated  by  the  modified  joint  action. 

The  other  simulated  failure  type,  state  perturbation,  is  parametrized  by 
a  positive  non-zero  integer  c,  which  determines  the  number  of  state  terms, 
which  are  removed  from  the  current  state,  as  well  as  the  number  of  terms 
which  are  added  to  it.  The  terms  to  be  added  or  removed  are  selected  also 
randomly  from  the  domain  language  according  to  a  uniform  distribution. 

We  implemented  the  experimental  setup  as  a  centralized  simulator  of  the 
environment  integrating  the  multi-agent  domain-independent  planner  MA- 
Plan.  The  individual  agents  are  initialized  by  the  planner  initialization  pro¬ 
cess,  together  with  a  given  planning  problem  instance.  Each  agent  runs  in 
its  own  thread  and  they  deliberate  asynchronously.  The  agents  send  peer- 
to-peer  messages  between  themselves  via  the  centralized  simulator  as  well. 
The  messages  are  sent  by  the  integrated  MA-Plan  planner  exclusively  in  the 
DisCS  P  phase. 

The  experiments  were  performed  on  FX-8150  8-core  processor  at  3.6GHz 
with  Java  Virtual  Machine  limited  to  2.5GB  of  RAM.  The  individual  mea¬ 
surements  were  parametrized  by  the  plan  failure  probability  P  and  each 
problem  instance  was  executed  10  times  with  various  value  samples.  The  re¬ 
sulting  data  are,  in  the  figures,  presented  with  the  natural  distribution.  The 
candlestick  charts  depict  the  differences  between  the  minimal  and  the  maxi¬ 
mal  measurements,  together  with  the  standard  deviation.  The  accompanying 
charts  represent  the  percentage  ratio  between  the  measured  variable  for  the 
particular  repairing  method  and  replanning  from  scratch  (normalized  at  the 
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100%  level).  The  values  presented  in  the  result  table  are  average  values  from 
the  measurements  of  the  same  parametrization. 

Since  we  are  using  the  planner  MA-Plan  as  a  black-box  algorithm,  the 
relative  proportion  to  the  replanning  approach  bear  a  higher  significance 
than  the  particular  absolute  numbers.  Moreover,  the  general  scalability  of 
the  proposed  plan  repairing  algorithms  stand  or  fall  by  the  used  planner. 
Therefore  improvements  in  scalability  of  the  presented  plan  repairing  tech¬ 
niques  should  correspond  to  improvements  of  the  used  domain-independent 
multi-agent  planner.  As  we  stated  before,  we  used  current  state-of-the-art 
planner.  Improving  such  planner  towards  better  scalability  is  an  interesting 
open  research  question  at  this  point. 

5.2.  Test  Problems,  Algorithms  and  Metrics 

The  experiments  were  conducted  on  four  planning  domains.  Three  of 
the  domains  originate  in  the  standard  single-agent  IPC  planning  bench¬ 
marks  [16].  Similarly  to  the  evaluation  of  the  MA-Plan  implementation  in  [10], 
we  chose  domains,  which  are  straightforwardly  modifiable  to  the  multi-agent 
setting:  LOGISTICS  (2-6  agents),  ROVERS  (2-4  agents),  and  satellites  (2- 
6  agents).  Additionally,  we  have  extended  the  set  of  IPC-based  domains  by  a 
well  known  coordination  domain  COOPERATIVE  pathfinding  (2-4  agents). 

Instances  of  the  LOGISTICS  problems  are  about  transporting  packages  be¬ 
tween  locations  by  a  fleet  of  heterogeneous  transport  vehicles.  A  representa¬ 
tive  example  of  a  LOGISTICS  problem  hk°93 — used  as  one  of  the  experiments — 
contains  three  agents  controlling  two  trucks  T1  and  T2  and  one  airplane  A. 
There  are  two  cities,  each  with  one  storage  depot  (dl  and  d2)  and  one  air¬ 
port  (al  and  a2).  The  trucks  can  move  m(from,to)  only  within  their  cities, 
between  one  depot  and  one  airport.  The  airplane  can  fly  f (from,  to)  among 
all  airports  in  the  environment,  but  cannot  land  at  the  depots.  All  vehicles 
can  load  l(package,  location)  and  unload  u(package,  location)  a  package  at  a 
location.  Initially,  there  is  one  package  p  at  one  of  the  depots  and  the  goal 
is  to  transport  it  to  the  other  depot  in  the  other  city.  The  trucks  start  at 
the  depots  and  the  airplane  starts  at  one  of  the  airports.  A  multi-agent  plan 
solving  this  particular  instance  is  Vl°93  = 

A:  /  e  e  e  l(p,  al)  f(al,a2)  u(p,  a2)  e  e  e  ^ 

T1  :  I  I (p,  d  1)  m(dl,al)  u(p,al)  e  e  e  e  e  e 

T2  :  \  m(d2,a2)  e  e  e  e  e  I(p,a2)  m(a2,d2)  u(p,d2)  / 

the  coordination  frequency  for  such  problem  is  cf(Jll°93)  =  |  =  0.4,  because 
5  =  4  and  length  of  the  plan  \Pl°93\  =  9  and  the  presented  plan  Vl°93  is 
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minimal  from  the  perspective  of  the  coordination  points.  For  the  context  of 
the  experiments,  the  LOGISTICS  domain  is  tightly  coordinated  in  that  it  re¬ 
quires  relatively  frequent  coordination  among  the  involved  agents:  airplanes 
and  trucks  need  to  wait  for  each  other  to  load  or  unload  the  transported 
packages.  The  parallel  version  of  the  LOGISTICS  domain  (par)  for  5  and  6 
agents  involves  two  parallel  logistics  sub-problems. 

Problems  of  the  ROVERS  domain  describe  space  exploration  missions  car¬ 
ried  out  by  autonomous  rovers  equipped  for  three  types  of  tasks:  soil  analysis 
s,  rock  analysis  r  and  imaging  i.  The  resulting  data  from  the  tasks  has  to 
be  communicated  c(s,  r,  i)  back  to  the  Earth  in  one  data  package  over  a  com¬ 
munication  channel  available  only  for  one  of  the  rovers  at  a  time.  The  data, 
can  be  communicated  only  if  they  are  prepared  p(s/r/i).  The  rock  and  soil 
analysis  can  be  executed  provided  that  the  rover  is  at  suitable  position  and 
has  empty  analytical  store.  The  store  can  be  emptied,  if  required.  The  rovers 
can  move  among  predefined  waypoints  with  a  known  information  about  the 
samples.  Images  can  be  taken  only  from  suitable  positions  and  with  a  camera 
calibrated  and  in  a  correct  mode.  An  example  problem  IT'"'’3  used  as  one  of 
the  experiments  has  three  fully  equipped  rovers  Rl,  R2  and  R3.  A  solution 


>rov3  _ 


Rl:  / 

•••  p(rl)  • 

•  P(il)  ' 

•  p(sl)  e  c(sl,rl,il 

6  \ 

R2  : 

•••  p(r2)  • 

•  P(i2)  • 

•  p(s2)  e  e 

c(s2,r2,i2) 

R3  :  \ 

• • •  p(r3)  • 

•  P(i3)  • 

•  p(s3)  c(s3, r3,  i3)  e 

6  ) 

10  private  actions 


has  coordination  frequency  cf(Urov3)  =  A  =  0.23  following  the  same  pro¬ 
cedure  as  presented  in  previous  paragraph  with  LOGISTICS.  Therefore,  for 
the  context  of  the  experiments,  the  ROVERS  domain  is  loosely  coordinated 
in  that  it  requires  coordination  only  at  the  end  of  plans. 

The  satellites  domains  describe  planning  for  a  set  of  independent  satel¬ 
lites  providing  various  types  of  deep  space  imagery  i  from  the  orbit.  Each 
imaging  instrument  on  board  of  a  satellite  has  to  be  firstly  turned  to  point  at 
one  of  predefined  target  directions.  Secondly,  each  imaging  instrument  has 
to  be  powered,  switched  on  and  calibrated  before  it  can  take  an  image  t(i) 
in  one  of  predefined  modes.  A  solution  of  one  of  the  experimental  instance 


28 


Usat3  using  three  satellites  SI,  S2  and  S3  is  psat3  = 


S1:  ( 

•••  i(il)\ 

S2: 

•••  i(i2) 

S3:  V 

•••  i(i3)  / 

3  private  actions 

In  this  case  the  coordination  frequency  cf(Jlsat3)  =  |  =  0,  as  there  is  no 
public  action  in  an  optimal  plan.  Therefore  the  domain  is  uncoordinated 
in  that  it  does  not  need  any  coordination  between  the  satellites  acquiring 
images  individually. 

Finally,  in  the  COOPERATIVE  pathfinding  domain,  a  team  of  robots 
move  on  a  3x3  grid  (positions  xlyl  to  x3y3),  where  only  a  single  robot  can 
occupy  one  cell.  The  goal  for  the  robots  is  to  move  m (from,  to)  to  initial 
positions  occupied  by  the  other  robots  in  the  initial  state.  A  representative 
problem  ncp3  contains  three  robots  Rl,  R2  and  R3  and  a  solution  plan  Vcp3 
is 


Rl  : 

/  m(xly2,xlyl) 

m(xlyl,  x2yl) 

R2  : 

j  m(x2yl,x3yl) 

m(x3yl,x3y2) 

R3  : 

\  m(x3y2,x2y2) 

m(x2y2,  xly2) 

consequently  the  coordination  frequency  c/(ITp3)  =  |  =  1  as  each  action  in 
an  optimal  plan  is  public  and  therefore  the  domain  represent  a  fully  coordi¬ 
nated  problems. 

To  evaluate  validity  of  Hypothesis  1,  the  multi-agent  planning  problems 
were  tested  on  the  experimental  setup  against  a  plan  repair  algorithm  im¬ 
plementing  replanning  from  scratch  and  two  of  the  repair  algorithms  Back- 
on-Track-Repair  (BoT  repair,  Algorithm  2)  and  Repeated-Lazy-Repair  (RLazy 
repair,  Algorithm  4)  introduced  in  the  previous  section. 

Efficiency  problems  of  the  original  MA-Plan  implementation  (in  [10])  lim¬ 
ited  the  experiments  to  plans  with  maximally  six  landmarks,  coordination 
points,  per  agent.  Additionally,  the  Back-on-Track-Repair  algorithm  could 
not  leverage  disjunctive  goal  form  (cf.  construction  of  Hback  in  Algorithm  2) 
and  this  was  emulated  by  an  iterative  process  testing  all  term  conjunctions 
in  a  sequence  and  thus  resulting  in  multiple  runs  of  the  DisCS P  solver  instead 
of  a  single  run  with  disjunctive  goal. 

We  used  four  metrics  to  evaluate  the  measurements: 
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execution  length  is  the  overall  number  of  joint  actions  the  experimental 
setup  executed, 

planning  time  was  the  measured  cumulative  time  consumed  by  the  under¬ 
lying  MA-Plan  planner  used  for  generating  initial  and  repairing  plans, 

repairing  time  is  the  overall  time  spent  in  MA-Plan  invocations  minus  the 
first  planning  process  of  the  initial  plan;  and  finally, 

communication  corresponds  to  the  number  of  messages  and  communica¬ 
tion  volume  in  bytes  passed  between  the  agents  during  the  planning  or 
plan  repair  process.  That  is  messages  generated  by  the  DisCSP  solver 
in  the  MA-Plan  planner. 

5.3.  Results  mid  Discussion 

The  first  batch  of  experiments  directly  targets  validation  of  Hypothesis  1: 

Multi-agent  plan  repair  is  expected  to  generate  lower  commu¬ 
nication  overhead  in  tightly  coordinated  domains. 

LOGISTICS  and  COOPERATIVE  pathfinding,  as  tightly  and  fully  coordi¬ 
nated  domains  with  dynamics  of  the  simulated  environment  modeled  as  ac¬ 
tion  failures,  are  suitable  experiments  to  provide  required  insight.  Table  1 
shows  results  for  a  fixed  failure  probability  P  =  0.3  and  Figures  1,  2  and 
3  depict  the  results  of  the  experiment  for  3  agents  LOGISTICS  with  variable 
probability  P. 

The  highlighted  results  for  4-agent  LOGISTICS  in  the  table  shows  that 
the  communication  overhead  generated  by  the  Repeated-Lazy-Repair  (RLazy) 
algorithm  is  at  25%  of  that  generated  by  the  replanning  approach.  For  4- 
agent  COOPERATIVE  PATHFINDING,  the  communication  overhead  generated 
by  the  Back-on-Track-Repair  (BoT)  is  at  18%  of  that  generated  by  the  re¬ 
planning.  Additionally,  the  communication  overhead  decreases  with  the  in¬ 
creasing  number  of  agents  in  the  problems.  That  means,  the  plan  repairing 
algorithms  scale  better  than  replanning  from  scratch.  The  trends  in  Fig¬ 
ures  1,  2  and  3  for  LOGISTICS  domain  show,  that  the  results  are  also  valid  for 
higher  values  of  P.  Furthermore  the  overhead  decreases  with  increasing  fail¬ 
ure  probabilities.  The  communication  overhead  generated  in  the  experiment 
for  various  probabilities  P  by  the  Back-on-Track-Repair  algorithm  is,  over 
all  the  measured  probabilities,  on  an  average  at  59%  (36%  at  best)  of  that 
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Figure  1:  Experimental  results  of  the  communication  metrics  for  LOGISTICS  domain  with 
3  agents  and  action  failures. 


generated  by  the  replanning  approach.  The  Repeated-Lazy-Repair  algorithm 
performed  even  better  and  on  average  produced  only  43%  (11%  at  best)  of 
the  communication  overhead  generated  by  the  replanning  algorithm.  In  a 
consequence,  the  experiments  strongly  support  our  hypothesis. 

The  overall  time  spent  in  the  planning  phase  (used  by  the  MA-Plan  algo¬ 
rithm)  by  the  plan  repair  algorithms  echoes  the  results  for  the  communication 
overhead.  Plan  repairing  scales  better  with  higher  numbers  of  agents  in  both 
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Figure  2:  Experimental  results  of  the  planning  time  metrics  for  LOGISTICS  domain  with  3 
agents  and  action  failures. 


LOGISTICS  and  COOPERATIVE  pathfinding.  On  average,  over  all  the  mea¬ 
sured  probabilities  P  in  3-agent  LOGISTICS,  the  computational  efficiency  was 
at  54%  (34%  at  best)  and  at  51%  (12%  at  best)  for  Back-on-Track-Repair 
and  Repeated-Lazy-Repair  respectively  in  comparison  to  replanning.  Figure  1 
depicts  these  results. 

The  second  batch  of  experiments  focused  on  boundaries  of  validity  of  the 
positive  result  presented  above.  In  particular,  we  validated  the  condition  on 
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Figure  3:  Experimental  results  of  the  execution  length  metrics  for  LOGISTICS  domain  with 
3  agents  and  action  failures. 


the  coordination  tightness  and  feasibility  of  failures.  The  auxiliary  hypothesis 
we  validated  states: 

With  decreasing  coordination  frequency  of  the  planning  do¬ 
main,  the  communication  efficiency  gains  of  repairing  techniques 
should  decrease.  For  loosely  coordinated  domains  the  communi¬ 
cation  efficiency  of  plan  repair  should  be  on-par  with  that  of  the 
replanning  approach. 
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To  validate  the  auxiliary  hypothesis  we  ran  experiments  with  ROVERS  as  a 
loosely  coordinated  and  satellites  as  an  uncoordinated  planning  problem. 
The  results  in  Table  1  shows  that  the  plan  repairing  algorithms  are  only 
slightly  better  (maximally  10%)  in  terms  of  the  generated  communication 
overhead  than  replanning,  regardless  of  the  number  of  agents.  The  trend 
in  Figure  4  shows  similar  results  for  various  failure  probabilities  P.  The 
presented  results  support  the  auxiliary  hypothesis. 
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Figure  5:  Experimental  results  for  logistics  domain  with  3  agents  and  state  perturbations 
with  c  =  1. 


The  third  batch  of  experiments  targeted  the  perturbation  magnitude  of 
the  plan  failures.  The  second  auxiliary  hypothesis  we  validated  states: 

Communication  efficiency  gam  of  plan  repairing  in  contrast  to 
replanning  should  decrease  as  the  difference  between  the  nominal 
and  the  corresponding  failed  states  increases. 
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The  underlying  intuition  is  that,  in  the  case  the  dynamic  environment  gen¬ 
erates  only  relatively  small  state  perturbations  and  the  failed  states  are  “not 
far”  from  the  actual  state,  the  plan  repair  should  perform  relatively  well.  On 
the  other  hand,  if  the  state  essentially  “teleports”  the  agents  to  completely 
different  states,  replanning  tends  to  generate  more  efficient  solutions  than 
plan  repair. 

To  tackle  this  hypothesis,  we  modified  the  LOGISTICS  experiment  to  simu¬ 
late  state  perturbations  as  the  model  of  the  environment  dynamics.  Figure  5 
depicts  results  of  the  experiment  for  c  —  1.  The  perturbed  state  for  c  =  1 
is  produced  by  removing  one  term  from  the  actual  state  and  adding  another 
one.  As  the  chart  shows,  under  random  perturbations  the  plan  repairing  tech¬ 
nique  lost  its  improvement  against  replanning.  For  stronger  perturbations 
with  c  =  2,3,4  (not  shown  in  the  figure),  the  ratio  between  plan  repairing 
and  replanning  remained  on  average  the  same.  The  trend  of  the  absolute 
numbers  of  messages,  planning  time  and  execution  length  was  slightly  de¬ 
creasing,  as  the  probability  of  opportunistic  effects  increased. 

Beside  supporting  the  presented  hypotheses  the  results  also  show  the  dif¬ 
ferences  between  the  two  plan  repairing  algorithms.  Table  1  highlights  the 
best  results  for  communication  volume  and  planning  time.  In  most  cases 
the  Repeated-Lazy-Repair  algorithm  is  more  efficient  in  communication  than 
the  Back-on-Track-Repair  algorithm.  The  exceptions  are  the  COOPERATIVE 
pathfinding  and  ROVERS  domains  with  higher  numbers  of  agents.  These 
problems  share  high  combinatorial  complexity  (cooperative  pathfind¬ 
ing  in  coordination  and  ROVERS  in  local  planning)  and  therefore  more  plan 
preserving  techniques,  as  Back-on-Track-Repair,  benefit. 

6.  Final  Remarks 

In  the  presented  paper,  we  i)  formally  introduced  the  problem  of  multi¬ 
agent  plan  repair,  ii)  formulated  a  notion  of  relative  coordination  frequency  of 
a  planning  problem  based  on  Brafman  and  Domshlak’s  number  of  coordina¬ 
tion  points,  iii)  proposed  three  algorithms  for  solving  the  repair  problem  and 
proved  their  correctness,  and  finally  iv)  formulated,  as  well  as  experimentally 
validated  the  hypothesis  stating  that  under  certain  conditions,  multi-agent 
plan  repair  approaches  tend  to  be  more  efficient  in  terms  of  the  communica¬ 
tion  overhead  they  generate  in  comparison  to  replanning  from  scratch.  Our 
results  well  support  the  core  hypothesis  of  the  paper  and  we  additionally  per¬ 
formed  a  series  of  experiments  validating  its  boundary  conditions  articulated 
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by  the  series  of  auxiliary  hypotheses  in  Section  5. 

The  line  of  research  underlying  this  paper  well  correlates  with  recent 
works  on  classical  single-agent  planning  sub-domains,  such  as  partial  ordered 
plan  monitoring  and  repairing  [17],  conformant  [18]  and  contingency  plan¬ 
ning  [19],  plan  re-use  [3]  and  plan  adaptation  [20].  Environment  dynamics  is 
also  handled  by  approaches  based  on  Markov  decision  processes.  The  main 
difference  to  our  approach  is  that  the  state  perturbations  utilized  in  our  ex¬ 
periments  have  a  priori  unknown  probabilities.  Our  own  recent  approach  to 
the  problem  of  multi-agent  plan  repair  in  [21]  can  be  seen  only  as  a  precursor 
to  the  formal  and  rigorous  treatment  of  the  problem  in  this  paper.  Therein, 
we  described  the  first  steps  towards  a  formal  treatment  of  the  problem,  as 
well  as  proposed  two  specific  incomplete  algorithms  for  solving  the  problem, 
very  distinct  from  the  ones  presented  here.  A  sketch  of  the  ideas  behind 
the  three  plan  repairing  algorithms  were  additionally  published  in  our  recent 
work  [14],  however  without  a  formal  description  in  full  details  and  exhaustive 
evaluation.  Those  are  presented  herein. 

There  are  several  open  challenges  resulting  from  the  presented  work. 
Firstly,  the  multi-agent  planning  framework  (MA-Strips)  is  not  expres¬ 
sive  enough  to  describe  certain  aspects  of  concurrent  actions  and  should  be 
extended  to  this  end.  This,  we  suspect,  will  also  influence  the  multi-agent 
planning  complexity  analysis.  In  particular,  there  is  no  way  to  account  for 
joint  actions  which  have  effects  strictly  different  than  the  union  of  the  in¬ 
dividual  actions  involved.  Secondly,  there  is  a  need  for  more  efficient  and 
feature-full  implementations  of  multi-agent  planners,  as  the  gap  between  the 
state-of-the-art  classical  planners  and  multi-agent  planners  is  still  enormous. 
Addressing  this  challenge  should  also  answer  the  question  of  scalability  poten¬ 
tial  of  the  plan  repairing  approaches  based  on  such  planners.  Thirdly,  there 
is  a  lack  of  standardized  planning  benchmarks  for  multi-agent  planning,  es¬ 
pecially  considering  tightly  coordinated  planning  problems.  Exploration  of 
such  is  needed  to  further  evaluate  the  hypotheses  presented  in  this  paper. 
Finally,  we  left  out  the  work  towards  resolving  the  validity  of  Hypothesis  2 
aiming  at  analytical  investigation  of  complexity  properties  of  multi-agent 
plan  repair  to  future  work. 
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Abstract 

Deterministic  domain-independent  multiagent  planning  is  an  approach  to  co¬ 
ordination  of  cooperative  agents  with  joint  goals.  Provided  that  the  agents  act  in  an 
uncertain  and  dynamic  environment,  such  plans  can  fail.  The  straightforward  ap¬ 
proach  to  recover  from  such  situations  is  to  compute  a  new  plan  from  scratch,  that 
is  to  replan.  Even  though,  in  a  worst  case,  plan  repair  or  plan  re-use  does  not  yield 
an  advantage  over  replanning  from  scratch,  there  is  a  sound  evidence  from  prac¬ 
tical  use  that  approaches  trying  to  repair  the  failed  original  plan  can  outperform 
replanning  in  selected  problems.  One  of  the  possible  plan  repairing  techniques  is 
based  on  preservation  of  fragments  of  the  older  plans. 

This  work  theoretically  analyses  complexity  of  plan  repairing  approaches  based 
on  preservation  of  fragments  of  the  original  plan  and  experimentally  studies  three 
practical  aspects  affecting  its  efficiency  in  various  multiagent  settings.  We  focus 
both  on  the  computational,  as  well  as  the  communication  efficiency  of  plan  repair 
in  comparison  to  replanning  from  scratch  and  we  report  on  the  influence  of  the 
following  properties  on  the  efficiency  of  plan  repair:  (1)  the  number  of  involved 
agents  in  the  plan  repairing  process,  (2)  inter-dependencies  among  the  repaired 
actions,  and  finally  (3)  particular  modes  of  re-use  of  the  older  plans. 

Keywords:  multiagent  systems,  automated  planning,  plan  repair 
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1  Introduction 


Consider  a  team  of  heterogeneous  robots  working  together  so  as  to  execute  a  mission 
in  an  environment.  Since  the  robots  feature  heterogeneous  capabilities,  it  might  well 
be  that  none  of  them  is  able  to  complete  the  mission  on  its  own,  however  by  a  careful 
coordination  and  teamwork,  they  should  be  able  to  reach  the  joint  objective.  The  team 
of  physical  robots  is  embodied  in  a  dynamic  environment  in  which  various  events  and 
plan  execution  interruptions  occur  and  most  importantly,  in  which  actions  of  the  agents 
can  fail.  To  execute  their  mission,  the  robots  represented  by  deliberative  agents  must  be 
able  to  cope  with  such  a  dynamics  on  both,  the  individual,  as  well  as  the  coordination 
level.  Here  we  focus  on  the  problem  of  multiagent  plan  repair  which  tackles  such  issue. 

Recently,  an  approach  of  multiagent  (MA)plan  repair  (MA-REPAIR)  was  proposed 
in  [8],  based  on  multiagent  planning  (MA-STRIPS)  as  introduced  in  [2]  and  the  clas¬ 
sical  MODDELINS  principle  from  [12],  MA-STRIPS  is  an  approach  to  planning  for 
teamwork  and  coordination  extending  the  classical  STRlPS-based  planning  techniques. 
According  to  the  MA-REPAIR  approach,  the  multiagent  team  computes  a  team  plan  us¬ 
ing  a  fully  decentralized  MA-STRIPS  planning  algorithm,  and  subsequently  executes 
the  plan,  while  at  the  same  time  monitoring  of  possible  failures  of  plan  execution.  Upon 
an  occurrence  of  such  a  failure,  the  team  stops  execution  and  invokes  a  plan  repair  al¬ 
gorithm  and  fixes  the  failed  joint  plan  in  order  to  reach  a  joint  goal  state  from  the  state 
in  which  the  failure  occurred. 

It  can  be  argued  that  plan  re-use  based  on  MODDELINS  does  not  yield  much 
advantage  with  respect  to  the  computational  complexity  in  the  worst  case  [12],  since 
attempts  to  fix  a  failed  plan  sometimes  lead  to  replanning  from  scratch  anyway.  In  mul¬ 
tiagent  and  multi-robot  settings,  where  communication  is  unreliable  and  costly,  how¬ 
ever,  it  is  often  the  communication  which  is  of  higher  priority  than  the  computational 
complexity. 

In  [10],  the  authors  have  proposed  prefix  and  suffix-based  approaches  to  multiagent 
plan  repair.  These  repairing  approaches  save  communication  in  contrast  to  replanning 
from  scratch  in  tightly  coupled  problems  with  action  failures,  however  a  research  ques¬ 
tion  which  plan  repairing  techniques  are  more  appropriate  for  which  planning  domains 
and  problems  remained  unanswered.  In  this  work,  we  extend  our  recent  experimen¬ 
tal  study  [9],  where  we  have  generalized  the  prefix  and  suffix-based  approaches  and 
present  a  coherent  analysis  of  computational  complexity  and  practical  properties  of 
how  particular  multiagent  plan  repair  techniques  and  particular  parameterizations  per¬ 
form  in  different  planning  domains. 

2  Multiagent  Planning  &  Repair 

We  use  MA-Strips  [2]  as  a  model  for  the  multiagent  planning  problems. 

Definition  1.  Let  a  quadruple  n  =  (P,  A.  /,  G)  be  a  multiagent  planning  problem  over 

•  a  finite  set  of  propositions  P  denoting  facts  about  the  environment  the  agents 
operate  in. 
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Figure  1:  A  multiagent  plan  in  matrix  form  with  a  visual  representation  of  the  inter¬ 
mediate  states  Si, . . . ,  Sm_i  and  joint  actions  =  {an, . . . ,  ain).  The  gray  nodes 
represent  planned  states  of  evolution  of  the  environment,  I  is  the  initial  state  and  G  is 
the  set  of  goal  conditions  defining  a  set  of  possible  goal  states.  The  arcs  represent  the 
planned  joint  actions. 


•  a  set  of  n  agents  A  =  {A1; . . . ,  An},  each  characterized  by  a  finite  set  of  ac¬ 
tions  with  STRIPS  syntax  and  semantics  the  agent  can  perform,  formally  Ai  = 
{ (pre(a),  add(a),  del(a))  |  pre(a),  add(a),  del(a)  C  P},  where  pre(a),add(a), 
and  del  (a)  represent  sets  of  preconditions,  add  effects,  and  delete  effects  respec¬ 
tively;  the  transition  function  describing  change  of  the  environment  in  state  S  af¬ 
ter  execution  of  action  a  into  new  state  S'  is  defined  as  S'  =  (SUadd(a))\del(a) 
provided  that  pre(o)  C  S,  i.e.,  the  action  a  is  applicable  in  the  state  S, 

•  an  initial  state  /  CP  the  environment  begin  in,  and 

•  a  goal  state  conditions  G  C  P  characterizing  agents’  joint  objective(s)  s.t.  a  state 
S  C  P  is  a  goal  state  iff  G  C  S. 

Additionally,  we  define  a  set  of  propositions  of  an  action  a  as  prop(a)  =  pre(a)  U 
add  (a)  U  del  (a)  and  a  set  of  propositions  an  agent  A,-,  affects  or  is  affected  by  as  P,  = 
Uoe  4.  prop(o).  Each  action  set  also  contains  a  empty  action  e  =  (0,  0,  0).  According 
to  MA-Strips,  a  distinguished  subsets  of  agents’  public  actions  known  to  all  other 
agents  is  defined  as  A?ub  =  {a|a  £  A,t  s.t.  3j  ^  i  :  prop(o)  n  Pj  ^  0}  and  the 
complement  denoted  as  private  actions  is  defined  ,4’>"v  =  A  ,  \  A  joint  action 

of  all  agents  is  defined  as  a  n-tuple  a  =  (ai, . . . ,  an)  where  for  each  a,  holds  ai  £  A,. 
Execution  of  a  joint  action  a  is  defined  as  S'  =  {S  U  lJaSaadd(a))  \  ljaeadel(a) 
provided  that  (Ja6a  pre(a)  C  S  and  |Jaea  add  (a)  n  UaGa  del(o)  =  0- 

Definition  2.  A  sequence  of  joint  actions  ir  =  (ai, . . . ,  am)  is  a  multiagent  plan 
solving  a  multiagent  planning  problem  II  =  (P,  A,  So,  G)  iff  UaGa  Pre(a)  Q  Si- 1 
and  U(1Ga  add(a)  FI  UaGa  deKa)  =  ®  where  for  the  intermediate  states  hold  Si  = 
{Si- 1  U  JQ6ai  add(a))  \  Ua6ai  del(a)  for  i  £  1, . . . ,  m  and  G  C  Sm. 

We  will  denote  the  length  of  a  sequence  of  joint  actions  n  =  (ai, . . .  ,am)  as 
1 7r |  =  m.  A  concatenation  of  two  multiagent  plans  will  be  denoted  as  (a}, . . . ,  a*.)  • 
(a^, . . . ,  a f)  =  (a}, . . . ,  a jrl .  af, . . . ,  af).  To  index  a  fc-th  joint  action  in  n,  we  will 
write  7r[fc]  and  n[k, . . .  ,1]  where  1  <  k  <  l  <  m  will  denote  a  fragment  of  the  sequence 
(a/;, . . . ,  a;).  A  public  projection  of  plan  tt  will  be  denoted  as  7rpub  and  replaces  all 
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actions  which  are  not  in  any  a  £  Apub  by  the  empty  action  e.  A  visual  representation 
of  a  multiagent  plan  is  depicted  in  Figure  1 . 

With  formally  defined  multiagent  planning  problems  and  multiagent  plans  solving 
them,  we  can  formally  present  a  problem  of  multiagent  plan  repair  MA-Repair  as 
defined  in  [10]: 

Definition  3.  Let  a  quadruple  E  =  (II,  tt,  F,  k)  be  a  multiagent  plan  repair  problem 
over 


•  a  multiagent  planning  problem  II  =  (P,  A,  I,  G), 

•  an  original  multiagent  plan  tt  solving  the  problem  II  which  execution  failed  s.t. 
after  execution  of  some  a*,  the  resulting  state  differs  from  the  intermediate  state 

Sk, 

•  a  state  F  C  P  which  the  system  happens  to  be  in,  unexpectedly  after  the  plan 
execution  failure,  and 

•  the  step  k  £  1, . . . ,  |7r|,  after  which  the  failure  occurred. 

A  solution  to  a  multiagent  plan  repair  problem  E  =  (II,  tt,  F,  k)  is  a  multiagent  plan 
n'  solving  a  modified  planning  problem  11'  =  (P,  A,  F,G). 

Two  auxiliary  definitions  are  needed  for  formal  description  of  the  proposed  plan 
repair  algorithm. 

Definition  4.  Let  forward  proposition  propagation  operator  ©  be  a  mapping  ©  :  2P  x 
(Ai  x  A2  x  •  •  •  x  Am)  — >  2P,  where  each  A  =  x”=1A;  and  m  >  1,  defined  as 

S®7T  ha  /  (S  U  Uaeirll]  s.t.  pre(a)CS  add («))  \  Ua&r[l]  s.t.  pre(a)CS  del(a)  for  | TT |  =  1, 
|  (S  ©  (7r[l]))  ©  7t[2,  . . . ,  |7r|]  otherwise. 

(1) 

Similarly  to  the  transformation  operator  ©,  a  reverse-transformation  operator  is 
defined  as  follows: 

Definition  5.  Let  proposition  back-propagation  operator  0  be  a  mapping  0  :  2P  x 
(Ai  x  A 2  x  •  •  •  x  Am)  -A  2P,  where  each  A  =  x”=1  A,  and  m  >  1,  defined  as 


(S  U  UaSTTll]  del(a))  \  UaG7r[l]  add(«)  for  kl  = 
(S  0  (7r[| 7r|] ))  0  7T [1, . . . ,  |7r|  —  1]  otherwise. 


For  further  use,  we  define  a  metrics  on  actions  in  a  multiagent  plan  describing 
importance  of  the  action  with  respect  to  the  number  of  actions,  which  would  be  no 
longer  applicable  if  the  action  is  not  present  in  the  plan,  formally: 

Definition  6.  Let  A^ep  C  a/,  in  a  multiagent  plan  n  at  step  k.  Let  a  set  of  actions  A^®^ 
contain  all  actions  of  afc+1  which  are  dependent  on  actions  of  A^ep  based  on  the  effect- 
precondition  relation ,  formally  =  {a|a  £  a^+i  s.t.  |Ja,g^dep  eff(a')  D  pre(a) 

0}. 
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Figure  2:  A  visual  representation  of  back-on-track  (left)  and  lazy  repair  (right).  The 
nodes  and  arcs  has  the  same  meaning  as  in  Figure  1.  The  dashed  arcs  represent  planned 
but  not  yet  executed  actions,  the  solid  ones  were  already  executed.  Only  the  joint  ac¬ 
tion  ai  was  executed  without  a  failure.  The  execution  of  a2  failed  and  the  environ¬ 
ment  ended  in  the  state  F.  The  orange  nodes  represent  planned  states  of  repair  plan 
(ari, . . . ,  a rk)-  The  pale  orange  states  with  their  respective  actions  represent  various 
possibilities  of  back-on-track  repair  plans.  The  cyan  nodes  represent  a  solution  by  re¬ 
planning.  In  lazy  repair,  the  application  of  a2  and  further  actions  ignores  preconditions, 
which  might  not  be  longer  satisfied  in  F. 

The  dependency  metrics  dep  of  an  action  a  £  a^  £  n  is  then  defined  as 


(3) 


Besides  straightforward  multiagent  replanning  from  scratch,  which  invokes  a  mul¬ 
tiagent  planner  at  the  point  of  a  failure  and  then  executes  the  computed  plan  right  away, 
two  main  approaches  to  multiagent  plan  repair  were  presented  in  [10]:  back-on-track 
(BoT)  and  lazy  repair  (LR). 

Both  algorithms  first  formulate  a  modified  multiagent  planning  problem  and  rely  on 
the  underlying  multiagent  planner  to  compute  a  plan  fragment  used  for  re-composition 
into  a  solution  plan  repairing  the  original  failed  one.  We  will  refer  to  the  underlying 
planner  as  an  inner  planner  as  it  will  be  used  as  a  component  of  the  plan  repair  algo¬ 
rithm.  The  back-on-track  strategy  (see  Figure  21eft)  tries  to  fix  the  prefix  of  the  failed 
plan  by  computing  a  plan  from  the  state  in  which  the  system  happens  to  be  right  after 
the  detection  of  a  plan  execution  failure  to  some  state  along  an  ideal  failure-free  ex¬ 
ecution  of  the  original  plan.  The  resulting  multiagent  plan  re-uses  some  suffix  of  the 
original  plan,  if  possible,  and  extends  the  plan  at  its  beginning.  The  idea  underlying 
the  lazy  repair  is  complementary  (see  Figure  2right).  Lazy  repair  takes  the  remainder 
of  the  original  plan,  re-uses  all  its  actions  which  still  can  be  executed  according  to  their 
preconditions  regardless  of  the  outcome  and  completes  the  plan  to  some  goal  state  of 
the  planning  problem.  This  way,  the  resulting  plan  is  composed  of  re-used  prefix  parts 
of  the  original  plan  with  an  appended  suffix  of  some  new  repaired  plan.  Experiments 
in  [10]  showed  that  these  approaches  lead  to  significant  savings  of  communication, 
as  well  as  computational  resources  in  comparison  to  replanning  from  scratch  on  used 
planning  domains  and  problems. 
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Based  on  the  experimental  analysis  of  the  two  plan  repair  algorithms  and  the  formal 
definitions,  we  can  state  the  core  hypotheses  of  our  work  here. 

Following  the  theoretical  results  of  Brafman  and  Domshlak’s  complexity  analysis 
of  MA-STRIPS  planning  in  [2],  our  motivation  is  not  to  substantially  increase  the  com¬ 
putational  complexity  of  the  repair  algorithms  in  respect  to  the  inner  planner.  Although 
the  communication  complexity  of  the  MA-STRIPS  planning  was  not  theoretically  stud¬ 
ied  in  the  literature  yet,  we  cover  it  in  the  hypothesis  as  well,  as  we  hypothesize  that 
the  communication  complexity  is  a  function  of  the  computational  one. 

Hypothesis  1.  Suffix,  prefix  or  combined  multiagent  plan  repair  does  not  introduce 
any  additional  exponential  dependency  on  number  of  involved  agents  in  respect 
to  the  inner  planner  both  in  computational  and  communication  complexity. 

Albeit  the  theoretical  results  showing  that  MA-STRIPS  planning  is  not  exponentially 
dependent  on  the  number  of  involved  agent,  we  hypothesize  that  practically  the  over¬ 
head  of  planning  or  plan  repair  with  higher  number  of  agents  grows,  especially  if  the 
inner  planner  assumes  only  public  goal  condition  propositions,  which  is  a  common 
assumption  in  literature  (e.g.,  in  [13]). 

Hypothesis  2.  Repairing  algorithms  minimizing  the  number  of  agents  involved  in  the 
plan  repairing  process  tend  to  generate  lower  computational  and  communication 
overheads  than  other  strategies. 

Immediate  repairing  of  actions,  which  substantial  number  of  other  actions  dependent 
on  (see  Definition  6),  is  intuitively  more  efficient  than  repair  of  such  actions  later  with 
possibly  smaller  reusable  parts  of  the  original  plan  and  higher  number  of  actions  to 
repair  in  general.  Next  hypothesis  describes  this  intuition. 

Hypothesis  3.  Repairing  algorithms  reusing  the  original  plan  as  a  suffix  generate 
lower  computational  and  communication  overheads  than  the  repairing  algo¬ 
rithms  reusing  the  original  plan  as  a  prefix  in  domains  with  actions  with  high 
values  of  the  dependency  metrics. 

If  we  parameterize  the  lengths  of  reused  prefix  u  and  suffix  v  of  the  plan  repairing 
process,  an  interesting  question  is,  how  do  different  combinations  of  these  parameters 
influence  the  efficiency  of  the  plan  repairing  process.  Let  to  be  the  length  of  the  re¬ 
usable  part  of  the  original  plan  7 r,  then  m  =  \n\  —  k,  where  k  is  the  step  after  the  failed 
action.  Obviously,  for  u  +  v  <  to,  there  will  be  a  gap,  which  has  to  be  filled  by  a  result 
of  the  inner  planner,  in  other  words  the  original  plan  was  underused.  Reversely,  for 
u  +  v  >  to,  there  will  be  an  overlap,  which  has  to  be  reverted,  i.e.,  the  original  plan 
was  overused.  Intuitively,  these  cases  are  in  a  sense  pathological.  The  last  hypothesis 
states  that  repair  strategies  not  underusing  nor  overusing  the  original  plan  should  be  the 
most  efficient: 

Hypothesis  4.  Repairing  algorithms  overusing  or  underusing  the  original  plan  tend 
to  generate  higher  computational  overheads  than  other  algorithms. 
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Figure  3:  A  visual  representation  of  generalized  repair.  The  nodes  and  arcs  have  the 
same  meaning  as  in  Figure  2.  The  repair  plan  (ari, a is  used  to  connect  the  prefix 
and  suffix  of  the  original  plan.  The  parameters  u  and  v  prescribe  how  many  actions  are 
reused  as  the  prefix  and  the  suffix  respectively. 


3  Generalized  Repair 

As  outlined  in  the  previous  section,  the  algorithm  used  in  the  further  analyses  is  a 
combination  of  the  lazy  and  back-on-track  approaches,  presented  in  [10],  which  are 
orthogonal  to  each  other  in  how  they  reuse  the  original  plans.  These  two  approaches 
can  be  combined  into  one  algorithm  using  the  original  plan  both  as  a  prefix  and  a 
suffix  together.  Such  an  approach  generalizes  the  first  two  approaches  and  combines 
the  original  plan  by  both  fashions  as  shown  in  Figure  3. 

Definition  7.  (generalized  repair)  Let  £  =  (II,  7r,  F,  k)  be  a  multiagent  plan  repair 
problem  and  let  IT  =  (P,  A.  F,  G)  be  the  corresponding  modified  multiagent  replan¬ 
ning  problem. 

A  multiagent  plan  tt'  solving  Ft'  is  a  generalized  repair  of  tt  for  vectors  of  indexes 
U  and  V  iff  there  is  a  decomposition  of  tt',  such  that  n'  =  np[k, . . . ,  k  +  it]  •  7Tfix  • 
7r[|7r|— d,  . . . ,  |7r|],  where  7f  p[k, . . . ,  k+u)  is  a  fragment  of  plan  tt  omitting  inapplicable 
actions  beginning  with  the  state  F  for  some  u  £  U,  t Tfix  is  a  new  plan  connecting  the 
reused  fragments,  and  7r[|7r|  —  v, . . . ,  |7r|]  is  a  fragment  of  the  original  plan  for  some 
v  £  V.  The  vectors  contain  only  valid  indexes  to  the  reused  part  of  the  multiagent  plan 
■n,  that  is  Vu  £  U  :  0  <  u  <  |7r|  —  k  and  \/v  £  V  :  0  <  v  <  |7r|  —  fc,  to  meet  the 
requirements  of  the  decomposition  parts  TTp[k, . . .  ,k  +  u\  and  tt[\tt\  —  v, . . . ,  |7r|] .  It 
holds  that  \U\  =  \V\  and  for  Vi  s.t.  1  <  i  <  \tt\  there  is  no  j  :  1  <  3  <  M  ,3  7^  i  s.t. 
(ui,  Vi)  =  ( Uj,Vj ).  Also  0  £  U,  0  £  V. 

To  illustrate  the  generalized  notion  of  this  repair,  it  can  be  shown  that  for  U  = 
( 1 7r |  —  k),V  =  (0)  the  approach  becomes  the  lazy  repair  and  for  U  =  (0),T  = 
( 1 7r |  —  k, . . . ,  0)  the  back-on-track  approach.  In  the  first  case,  the  original  plan  is  reused 
as  a  plan  fragment  ignoring  preconditions  of  length  \n\  —  k  equally  to  the  definition  of 
lazy  repair  in  [10].  In  the  other  case,  the  definition  of  the  index  vector  V  implies  trying 
to  reuse  a  plan  fragment  starting  with  length  |7r|  —  k  and  ending  with  length  0,  equally 
to  the  back-on-track  repair  definition  in  [10].  Finally,  U  =  (0),  V  =  (0)  describes 
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replanning. 

It  could  be  argued  that  generalization  reusing  the  original  plan  only  as  prefix  and 
suffix  parts  is  in  fact  not  the  only  possible  repair  scheme,  e.g.,  by  means  of  the  MOD- 
DELINS  scheme  presented  in  [12],  The  MODDELINS  reuse  scheme  describing  the  pre¬ 
sented  generalized  repair  approach  is 

(cTi ,  &2 ,  .  .  .  ,  ctj/)  '  Tffix  '  (*i|7r|  —  v  i  ^-|7r|  —  (v— 1)  >  •  •  •  ■>  ®-|7r|  ) )  (4) 

however  a  scheme 

(al au)i  ttgx  7  >  •  •  •  i  3-U-t-f)?  ^”fix>  (  *^-|  7T  |  —  W  5  •  •  •  5  ®-|7r|  )  (5) 

would  be  also  possible,  but  generalized  repair  cannot  directly  represent  it.  Since  the 
motivation  of  the  generalized  repair  is  not  to  be  general  in  the  sense  of  reuse  pattern 
of  the  original  plan,  but  be  general  from  perspective  of  reuse  of  the  original  plan  as 
both  prefix  and  suffix  plan  fragments.  For  such  case,  the  generalized  repair  scheme  is 
complete,  since  breaking  the  7r p[k,, . .  ,k  +  u]  or  the  7r[|7r|  —  v, . . . ,  |7r|]  fragments  into 
smaller  parts  would  require  additional  7r^x  plans,  which  could  be  concatenated  into  one 
central  fixing  fragment  as  defined  in  Definition  7. 

The  algorithm  for  the  generalized  repair  is  outlined  in  Algorithm  1 .  In  the  case, 
a  failure  is  detected  by  the  agent  team,  the  current  state  after  the  failure  is  retrieved 
and  the  plan  repair  algorithm  for  the  plan  repair  problem  E  =  (II,  7 r,  F,  k )  is  invoked. 
In  each  plan  repair  attempt  a  modified  multiagent  planning  problem  is  formulated  ac¬ 
cording  to  the  current  values  of  u  and  v  prescribing  the  length  of  the  reused  prefix 
and  suffix  of  the  original  plan.  These  parameters  are  took  from  two  vectors  U  and  V 
parameterizing  the  repair. 

If  a  repair  plan  is  found,  the  repair  process  finishes,  otherwise  another  attempt  with 
a  different  combination  of  u  and  v  is  made  (selection  of  u  has  priority  over  v  in  the 
combination  of  indexes).  The  resulting  repairing  plan  consists  of  three  components: 
the  preserved  prefix  of  the  original  plan  7rpre,  a  newly  computed  infix  7 Tfix  and  suffix 
part  7rsuf ,  again  preserving  a  part  of  the  original  plan  it. 

The  preserved  prefix  part  np  of  the  original  plan  corresponds  to  a  plan  fragment  of 
7 r  ignoring  the  preconditions  such  that  only  actions  applicable  in  sequence  beginning 
from  state  F  are  used.  The  actions  with  unmet  preconditions  are  simply  omitted.  Ad¬ 
ditionally,  the  prefix  7rpre  is  based  only  on  a  part  of  the  original  plan  effectively  reusing 
u  actions  beginning  after  the  fc-th  action  of  the  original  plan  7r.  The  suffix  part  7rsuf  is 
obtained  as  the  last  v  actions  of  the  original  plan  tt. 

Finally,  the  infix  part  of  the  plan  is  computed  by  invocation  of  the  inner  multiagent 
planner  MA-Plan1.  The  initial  state  of  the  modified  planning  problem  is  the  state  in 
which  a  failure-free  execution  of  the  repair  prefix  7rpre  would  result  in  starting  from  the 
state  F,  that  is  propagation  F  ®  7Tpre.  The  set  of  goal  states  G  ©  7rsuf  corresponds  to 
a  back-propagation  of  effects  of  the  preserved  suffix  component  7rsuf  from  the  set  of 
original  goals  G. 

If  the  multiagent  planner  finds  a  plan  for  the  modified  planning  problem,  the  repair 
plan  takes  the  form  7rpre  •  7Tfix  •  7rsuf  and  gets  executed  from  the  failure  point  on.  In 

1  In  the  experiments,  we  used  the  Planning  First  implementation  of  a  MA-STRIPS  planner  from  [13], 


Algorithm  1  Generalized-Repair(£,  U,  l^MA-Plan) 

Input:  A  multiagent  plan  repair  problemE  =  (II,  7r,  F,  k). 

Input:  Parameters  U  and  V  prescribing  the  lengths  for  reusing  of  the  original  plan  as 
prefix  and  suffix  respectively. 

1:  u ,  v  =initial  pair  of  u  £  U  and  v  £  V 

2:  repeat 

3:  npie  =  TTF[k,...,k  +  u] 

4:  7Tsuf  =  7r[|7r|  -  V,  .  .  .  ,  |7r|] 

5:  7Tfix  =  MA-Plan((P,  A,  F  ©  7Tpre,  G  ©  7Tsuf)) 

6:  if  7Tfix  /  0  then 

7:  7T  =  7Tpre  '  TTfix  *  7Tsuf 

8:  break 

9:  end  if 

10:  until  tested  all  pairs  ofu  £  U  andu  £  V 

it:  if 7r  —  0  then  return  fail 


the  case  no  repair  plan  can  be  found,  the  algorithm  attempts  the  repair  for  a  different 
combination  of  u  and  v  until  either  a  repair  plan  is  found,  or  it  turns  out  that  no  repair 
for  the  failure  exists. 

The  description  of  the  algorithm  will  be  concluded  with  proofs  of  soundness  and 
completeness.  Since  the  generalized  repair  algorithm  uses  inner  invocation  of  the  in¬ 
ner  multiagent  planner  similarly  to  back-on-track  and  lazy  repairs  [10],  its  correctness 
relies  on  the  correctness  of  the  inner  planner. 

Lemma  8.  (soundness).  Let  II  =  ( P ,  A,  /,  G),  be  a  multiagent  planning  problem  with 
agents  situated  in  a  dynamic  environment  in  which  the  environment  can  interfere  with 
the  plan  execution  and  let  n  be  a  solution  to  IL  Let  also  F  be  a  state  resulting  from  an 
interference  of  the  environment,  a  plan  failure,  at  a  step  k  of  execution  of  the  plan  7 r. 
£  =  (II,  7 r,  F,  k)  denotes  the  corresponding  multiagent  plan  repair  problem. 

Provided  that  the  execution  of  Generalized-Repair(£, t/,L,MA-P/an)  does  not 
fail,  but  finishes  with  a  resulting  plan  n1  and  MA-Plan  is  a  sound  MA-STRIPS  plan¬ 
ner,  a  failure-free  execution  of  nr  leads  to  some  goal  state  of  the  original  multiagent 
planning  problem  II. 

Proof.  Regardless  what  particular  state  F  ©  7rpre  the  failure-free  execution  of  the  ap¬ 
plicable  actions  from  the  fragment  ttf  ends  up  in,  the  solution  plan  7Tfix>  if  exists,  to 
the  problem  (P,  A,  F  ©  7Tpre,  G  ©  7rsuf)  will  take  the  system  from  the  state  F  ©  7rpre 
to  a  state  G  ©  7tsuf  corresponding  to  the  original  multiagent  planning  problem  II.  The 
back-propagated  propositions  G  ©  7rsuf  either  represent  a  required  (part  of)  state  along 
the  original  execution  trace  of  the  original  plan  7r  and  then  the  remainder  7rsuf  leading 
to  an  original  final  state  is  reused,  or  a  failure-free  execution  of  7Tfix  leads  directly  to 
one  of  the  final  states  defined  by  G  without  reusing  a  part  of  7r  as  suffix  7rsuf  =  0,  and 
therefore  G  ©  7rsuf  =  G.  □ 

The  initial  part  and  the  final  part  of  the  proof  is  based  on  the  soundness  proofs  of 
lazy  repair  and  back-on-track  respectively  in  [10].  As  mentioned  before,  in  general- 
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ized  repair,  these  two  approaches  merge,  therefore  the  proofs  are  based  on  the  same 
argumentation. 

Lemma  9.  ( completeness ).  Let  II  =  ( /I  A .  I,  G)  be  a  multiagent  planning  problem 
and  let  tt,F,k,  as  well  as  £  are  as  in  the  Lemma  8.  Let  U  and  V  be  integer  vectors  by 
Definition  7. 

If  there  exists  a  solution  plan  to  the  multiagent  planning  problem  IT  =  (P,  A,  F,  G) 
and  MA-Plan  is  a  complete  MA-STRIPS  planner,  then  the  execution  of  Generalized- 
Repair^,!/  ,V  ,MA-Plan)  finishes  and  finds  a  repair  plan  ofir. 

Proof  The  algorithm  tests  all  combinations  of  U  and  V  values.  It  eventually  tests  the 
required  combination  u  =  0  and  v  0.  Based  on  the  definition  of  the  algorithm,  in 
such  case,  tt  =  7rpre  •  7Tfix  •  7rsuf  degenerates  to  tt  =  7Tfjx  since  f ip[k, . . .  ,k  +  u\  = 
7 fp[k, . . . ,  fc]  =  0  and  7r[|7r|  —  v, . . . ,  |7r|]  =  7r[|7r|, . . . ,  |7r|]  =  0.  The  t rfix  is  then  a 
solution  of  II'  generated  by  MA-Plan.  If  such  solution  exists,  it  is  found,  since  MA- 
Plan  is  assumed  to  be  complete.  □ 

The  principle  of  the  completeness  proof  follows  the  idea  of  the  back-on-track  proof 
and  in  addition  it  requires  fewer  assumptions  than  the  lazy  repair  which  is  complete 
only  for  dead-end  free  planning  problems. 

4  Complexity  Analysis 

In  this  section,  the  presented  plan  repair  algorithm  will  be  theoretically  studied  from 
perspective  of  a  classical  complexity  metrics  and  one  additional  metrics  suitable  for 
distributed  algorithms.  The  classically  studied  metrics  is  time  complexity.  Addition¬ 
ally,  in  multiagent  systems,  one  can  use  a  metrics  based  on  an  asymptotic  ratio  of  com¬ 
munication  volume  required  for  an  algorithm  to  finish  to  size  of  the  input,  similarly  as 
in  the  case  of  the  time  complexity. 

4.1  Time  Complexity  of  MA-Plan 

The  generalized  plan  repair  algorithm  presented  in  the  previous  section  use  a  mul¬ 
tiagent  planner  as  a  component,  therefore  its  complexity  is  an  key  part  of  the  further 
analysis.  The  time  complexity  of  the  multiagent  planning  based  on  solving  of  coordina¬ 
tion  Constraint  Satisfaction  Problem  (CSP)  and  internal  heuristic  search  (and  therefore 
the  MA-Plan  implementation  of  the  planner  presented  in  [13]  as  Planning  First)  was 
studied  in  [2],  therefore  the  analysis  of  the  time  complexity  from  [2]  will  be  recalled  in 
the  following  paragraphs. 

Informally,  the  complexity  is  “the  number  of  times  [ needed ]  to  verify  that  a  certain 
choice  of  coordination-sequence  length  forms  a  basis  for  a  solutionx  the  complexity  of 
the  verification  process”.  To  formally  describe  the  time  complexity  of  the  MA-Plan  ap¬ 
proach,  firstly  [2]  define  size  of  the  CSP  domains  for  each  agents’  CSP  variable.  Each 
value  of  each  domain  represents  one  possible  coordination  sequence  for  one  agent. 
Such  sequences  consist  of  at  most  <5  coordination  points  defined  as  pairs  (a,  t)  with  a 
public  action  a  and  1  <t  <nd  for  n  agents. 
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The  idea  of  the  nS  limit  for  the  virtual  time  points  t  can  be  demonstrated  on  an 
example  with  n  =  2  agents  Ait  Aj,i  ^  j  and  6  =  3  with  precisely  three  used  coordi¬ 
nation  points  for  both  agents: 

At  :  f  a \  *  a2  * 

A  .  I  j  j  j  I  •  \V) 

■**-j  •  y  ^  CL-^  ^  CL<2  ^  CI3  J 

The  example  shows  the  longest  possible  coordination  pattern  for  that  particular  in¬ 
stance  as  both  the  agents  use  all  coordination  points  possible  and  the  actions  depends 
on  each  other  such  that  no  prolonging  of  the  pattern  is  possible.  In  that  case,  the  first 
coordination  point  is  (a\,  1)  and  the  last  one  (cig,  nS),  where  nS  =  6. 

The  size  of  the  CSP  domain  as  defined  in  [2]  is  for  an  agent  Ai 

iai  =  E  ( n5d )  •  i^rY  =  o(MArbi)5+i).  (?) 

d=  1  '  ' 


(  nS  \ 

The  term  1^1  represents  all  possible  combinations  of  d  virtual  time  points  for  the 

public  actions  (e.g.,  for  d  =  2,  nS  =  6  there  are  15  of  them  {(1,  2),(1,  3),. . .  ,(1, 6),  (2,  3), (2, 4),. . .  ,(5, 6)}) 
and  the  term  |A^ub|d  represents  all  possible  public  action  sequences  of  length  d  of  agent 
a  (e.g.,  for  d  =  2  and  |^4fub|  =  2  and  the  sequences  are  {0101,0102,0201,0202}). 
therefore  for  each  d,  the  complete  term  in  the  sum  counts  the  number  of  possible  coor¬ 
dination  sequences  for  d  coordination  points.  Finally,  the  summed  up  result  represent 
the  number  of  all  possible  coordination  sequences  for  one  agent. 

The  domain  size  is  then  used  in  the  final  time  complexity  formula  for  the  internal 
planning  constraints  (ipc)  in  the  CSP  in  the  following  form 

0(/(X)  •  n  •  max  |A|)  =  0(/(Z)  •  n{nS\ APub|)i+1)  =  Oipc,  (8) 

AieA 

where  the  term  f(I)  represents  maximal  complexity  of  individual  planning  X  with 
a  function  /  describing  the  cost  of  switching  from  regular  planning  and  .4p,,b  = 

ur=iYub- 

The  complexity  induced  by  the  coordination  constrains  (cc)  is  in  [2]  derived  from 
time  complexity  of  Adaptive-Tree-Consistency  algorithm  (ATC)  for  solving  CSP 
problems.  The  complexity  is  based  on  a  tree-width  uj  of  the  CSP  constraint  graph  [5] 
which  is 

0(n  •  max|Ar+1)  =  0(n(nS\Apub\)Sui+e )  =  Occ,  (9) 

AiGA 

where  e  =  <5  +  w  +  lis  dominated  by  Soj.  According  to  [2],  the  constraint  graph  is 
isomorphic  to  the  moral  graph  of  agent  interaction  graph,  therefore  u;  can  be  treated 
as  a  tree-width  of  the  agent  interaction  graph.  An  agent  interaction  graph  describes 
dependencies  of  the  agents  on  each  other  defined  by  public  actions. 

The  final  complexity  is  a  sum  of  the  complexities  from  Eq.  8  and  Eq.  9  for  the 
particular  constraints 

Owc  +  Occ  =  0(f( X)  ■  n{n5\Apub\)s+1  +  n{n5\Apub\)s^)  =  Omap ■  (10) 
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The  complexity  has  no  direct  exponential  dependence  on  the  number  of  agents  n,  has 
no  direct  exponential  dependence  on  the  length  of  the  individual  plans  of  the  agents  and 
has  no  direct  exponential  dependence  on  the  size  of  the  original  planning  problem  |II|. 
However,  the  complexity  of  the  individual  planning  f(I)  is  in  general  still  exponential 
in  the  size  of  the  individual  planning  problems. 

4.2  Time  Complexity  of  the  Generalized  Repair  Algorithm 

Let  n  =  (P,  A ,  I ,  G)  be  the  original  multiagent  planning  problem  and  n'  =  (P,  A.  F,  G) 
be  the  related  multiagent  replanning  problem.  The  time  complexity  of  the  planning 
problemn  is  complexity  of  the  MA-Plan  algorithm  as  showed  in  the  previous  section 

0(f(I)  •  n(nSq)s+1  +  n(n5q)Su+e),  (11) 

where  for  brevity  q  =  Apub  \ .  Straightforwardly,  the  replanning  complexity  is  in  gen¬ 
eral  the  same  as  of  planning,  since  the  only  difference  is  another  initial  state  F.  That 
means  n.  q  and  ui  are  the  same2.  The  length  S  depends  on  the  particular  initial  state, 
however  there  is  no  guarantee  that  S  for  F  will  be  generally  higher  or  lower  then  S 
for  I.  From  [12],  it  is  known  that  plan  reuse  cannot  be  generally  less  complex  than 
replanning  from  scratch. 

Lemma  10.  (time  complexity ).  Let  £  =  (n,  tt,  F,k)  be  a  multiagent  plan  repair 
problem  and  let  n  =  (P,  A,  I,  G)  be  the  related  multiagent  planning  problem.  Let  U 
and  V  be  the  index  vectors  of  generalized  plan  repair  by  Definition  7. 

The  asymptotic  time  complexity  of  Generalized-Repair(£,t/,  V)/l/M-P/an) 

Ogen  =  0(f(l)  ■  l2n(nSq)s+1  +  l2n(nSq)Sui+e  +  2 l3g2  +  21),  (12) 

where  Q  =  |H| ,  l  =  \tt\,  q  =  |Ap“b|,  and  the  rest  of  the  symbols  follow  the  definitions 
for  Eq.  1  Ofrom  [2 ]. 

Proof  The  Generalized-Repair  algorithm  is  parametrized  by  two  index  vectors  U 
and  V  which  are  used  in  a  repeated  search  for  a  solution  of  the  multiagent  plan  re¬ 
pair  problem.  The  time  complexity  is  informally  how  many  times  the  algorithm  needs 
generate  and  test  a  repair  strategy  X  what  is  the  complexity  of  the  generate  and  test 
procedure. 

The  generate  and  test  procedure  consists  of  two  proposition  propagation  procedures 
(by  Definitions  4  and  5)  each  in  worst  case  using  simulation  of  time  complexity,  which 
in  the  worst  case  require  l  testings  of  all  actions’  |A|  possible  preconditions  |P|,  there¬ 
fore  0(?|A||P|)  =  0(lQ2),  since  the  number  of  actions  and  propositions  cannot  be 
bigger  than  the  size  of  the  whole  planning  problem.  Technically,  the  extraction  of  the 
prefix  fragment  7rpre  is  part  of  the  proposition  propagation  process  F  ®  7rpre,  there¬ 
fore  there  is  one  0(lQ 2)  term.  Another  0(lQ2)  term  is  needed  for  the  proposition 
back-propagation  G  0  7rsuf  similarly  related  to  extraction  of  the  suffix  7rsuf.  The  7Tfix 
fragments  requires  solving  one  multiagent  planning  problem,  hence  the  time  complex¬ 
ity  of  one  generate  and  test  procedure  is 

Ogt  =  Omap  +  0(21G2)  =  0(f(X)  ■  n(nSq)5+1  +  n(nSq)6“+e  +  2 IQ2).  (13) 

2 The  difference  in  sizes  of  different  states  cannot  be  larger  then  |P|,  which  bounds  it  by  a  constant. 
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The  procedure  with  complexity  Oqt  is  in  the  Generalized-Repair  used  maxi¬ 
mally  \U\  ■  V  times.  If  a  solution  is  found,  two  additional  concatenations  are  needed 
to  finally  build  the  solution,  therefore 

0(\U\  ■  |H|)  •  Oct  +  O ( 1 7Tpre  |  )  +  O ( 1 7Tpre  | )  = 

0(l2  ■  f(X)  ■  n(ndq)s+1  +  l2  ■  t2(nSq)5uj+e  +  2 l3Q2  +  1 7rpre |  +  |7Tsuf  |)  = 

0(/(X)  •  l2n(nSq)s+1  +  l2n{n5q)Sw+t  +  2 l3g2  +  21)  =  0Gm) 

where  \U\  ■  \V\  =  0(|7r|2)  =  0(l2),  as  the  index  vectors  can  parametrize  at  most 
all  combinations  of  the  indices  to  the  original  plan  n  by  Definition  7.  The  lengths  of 
the  resulting  plan  segments  1 7rpre  |  and  1 7rsuf  |  are  bounded  by  the  length  of  the  plan 

1=  |4  □ 

The  resulting  time  complexity  of  generalized  repair  algorithm  does  not  comprise 
any  extra  exponential  dependency  on  any  of  the  additional  terms,  which  are  always 
polynomial  provided  that  l  is  polynomial  w.r.t.  Q.  Consistently  with  [12],  the  asymp¬ 
totic  worst-case  complexity  is  also  never  reduced,  which  is  anticipated  result  of  the 
analysis.  The  idea  of  the  presented  plan  repair  technique  in  general  is  of  lowering  5  by 
simplifying  the  inner  planning  process  with  help  of  reuse  of  parts  of  the  original  plan. 
Since  5  is  in  Omap  in  two  exponent  terms,  such  idea  is  positively  supported  by  the 
analysis  as  well. 

The  results  of  the  time  complexity  analysis  of  the  repair  algorithms  therefore  prove 
the  time  complexity  part  of  the  first  stated  Hypothesis  1  with  an  additional  assumption 
on  the  polynomial  length  of  the  repaired  plan.  Additionally,  they  are  not  in  conflict 
with  the  remaining  three  hypotheses.  Hypothesis  2  targets  taking  only  a  subset  of  A 
which  can  in  effect  lower  the  tree-width  oj  if  the  remaining  agents  are  less  coupled.  In 
Hypothesis  3,  the  length  <5  of  the  inner  repair  plan  is  targeted,  as  in  the  cases  of  problems 
with  actions  having  long  dependency  trees,  it  is  theorized  that  fixing  the  problem  sooner 
will  require  smaller  S  than  solving  it  later  possibly  with  longer  reverting  plan  of  a  bigger 
5.  In  the  last  Hypothesis  4,  smaller  S  should  be  achieved  by  possibly  short  repair  plans, 
where  no  reverting  is  caused  by  overusing  of  the  original  plans  u  £  U,  v  £  V,  u  +  v  > 
l  —  k  and  no  unnecessarily  long  repair  plans  are  needed  provided  that  u  +  v  <  l  —  k. 

4.3  Communication  Complexity  of  MA-Plan 

For  the  study  of  communication  complexity  of  the  presented  multiagent  plan  repair 
algorithm,  firstly,  the  communication  complexity  of  the  inner  planning  process  has  to 
be  known.  Since  the  work  of  Brafman  and  Domshlak  [2]  formally  tackle  only  time 
complexity,  we  propose  a  analysis  of  communication  in  a  similar  manner  based  on  an 
analysis  of  the  Adaptive-Tree-Consistency  (ATC)  algorithm. 

The  communication  complexity  of  ATC  can  be  derived  from  space  complexity 
which  was  studied  in  [5],  The  analysis  will  use  the  Big-O  notation  similarly  to  the 
previous  section.  To  distinguish  time  and  communication  complexity,  the  Big-O  will 
use  superscript  c  for  communication  complexity  Oc. 

Size  of  each  message  in  ATC  is  max  I  /7,  |sep,  where  sep  is  size  of  a  maximal  separa- 

tor  size  in  tree  decomposition  of  the  CSP.  The  size  of  the  separator  is  in  bucket-trees  [5] 
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the  tree-width  oj,  therefore  a  worst  case  size  of  one  message  is  max  Maximal 

number  of  messages  communicated  in  a  bucket-tree  CSP  solver  is  double  the  number 
of  arcs  (two  messages  for  each  arc  in  tree  of  n  vertex  graph),  therefore  2 (n  —  1),  where 
n  is  the  number  of  the  buckets.  The  buckets  represent  the  CSP  variables  (with  con¬ 
straints  in  them  resembling  the  principle  of  tree-decomposition),  therefore  the  number 
of  the  buckets  is  the  number  of  agents  in  the  coordination  constraint  (cc).  Since  the 
internal  planning  constraints  (ipc)  are  represented  as  unary  constraints,  they  do  not  re¬ 
quire  any  communication.  All  together,  it  gives  the  communication  complexity  of  the 
MA-strips  planning  as 


2 (n  —  1)  •  max|D,|w  =  Oc(en(n5q)Sul+e  +  e')  =  Oc(n(nSq)5ul+e )  =  OcMA P,  (15) 

AiGA 

where  e  is  dominated  by  Slo  in  the  exponent,  e  =  2  is  a  polynomial  coefficient  and  e' 
is  dominated  by  the  first  polynomial  term.  The  communication  complexity  of  planning 
using  CSP  for  coordination  is  therefore  not  exponentially  dependent  on  the  number 
of  agents  n,  it  is  not  dependent  on  the  complexity  of  the  individual  planning  I  and  it 
has  no  direct  exponential  dependence  on  the  size  of  the  original  planning  problem  |II|, 
similarly  to  the  time  complexity.  The  communication  complexity  is  bounded  by  one 
exponential  term  in  the  number  of  the  coordination  points  and  tree-width  of  the  agent 
interaction  graph 

exp(<5w).  (16) 

4.4  Communication  Complexity  of  the  Generalized  Repair  Algo¬ 
rithm 

The  communication  complexity  of  generalized  repair  will  be  derived  by  an  equal  pro¬ 
cess  as  in  the  case  of  the  time  complexity.  It  builds  on  the  derived  complexity  of  the 
inner  planning  of  MA-Plan  which  is  in  the  case  of  communication  Oc(n(n6q)Sul+e)  as 
showed  in  Eq.  15. 

Equally  to  the  time  complexity,  replanning  is  in  general  the  same  as  planning  from 
the  perspective  of  communication  complexity.  The  only  difference  is  in  the  initial  state. 
That  means  n,  q  and  w  are  the  same  and  6  depends  on  the  particular  initial  state  without 
any  general  guarantees  on  its  change  during  replanning. 

Lemma  11.  (communication  complexity )  Let  E  =  (II,  7 r,  F,  k)  be  a  multiagent  plan 
repair  problem  and  let  II  =  (P,  A ,  /,  G)  be  the  related  multiagent  planning  problem. 
Let  U  and  V  be  the  index  vectors  of  generalized  plan  repair  by  Definition  7. 

The  asymptotic  communication  complexity  of  Ge neralized-Repair(E,/7,C,A/M- 
Plan) 

OcGEN  =  Oc(l2n(n5q)Sui+e  +  2  nlzQ  +  n2+  n),  (17) 

where  Q  =  |II|,  l  =  \ir\,  q  =  |Ap“b|,  and  the  rest  of  the  symbols  follow  the  definitions 
for  Eq.  1 0  from  [2 ]. 

Proof.  In  the  Generalized-Repair,  the  process  is  separable  to  repeated  sub-processes 
of  generating  and  testing  of  a  repair  strategy  and  it  needs  an  additional  synchronization 
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broadcast  as  the  agents  has  to  be  aware  of  new  plan  repair  process  in  case  of  failure  in 
a  private  fact.  Such  broadcast  is  an  additive  factor  0^ync  =  0(n )  as  the  messages  are 
sent  to  all  agents..  The  communication  complexity  will  be  firstly  derived  for  one  such 
sub-process.  The  two  proposition  propagation  procedures  F  ©  7rpre  and  GQttsu{,  each 
in  the  worst  case  use  the  same  principle  as  a  distributed  simulation  of  execution  of  a 
multiagent  plan  segment  0(nl\P\)  =  0{nlQ )  and  are  used  with  one  inner  planning, 
therefore 

OcGT  =  Oc(n(n6q)Sul+e  +  2  nig).  (18) 

Equally  to  the  time  complexity,  the  sub-process  with  the  complexity  Ogt  can  be  used 
in  worst  case  \XJ\  ■  V  times.  If  a  sound  solution  is  found  the  agents  has  to  inform  each 
other,  therefore  there  is  beside  the  initial  synchronization  a  termination  synchroniza¬ 
tion  of  ()c(rt2),  although  the  local  plans  has  not  to  be  communicated  as  they  are  later 
executed  by  the  particular  agents.  The  communication  complexity  of  the  Generalized- 
Repair  is 


OsCync  +  Oc( \u\  ■  \V\)  ■  OcGT  +  Oc(n2)  = 

Oc(n  +  l2n(nSq)Sui+e  +  2nl3Q  +  n2)  = 

Oc(l2n{n5q)Sui+e  +  2nl3G  +n2  +n)  =  OcGEN,  (19) 

where  1 1/|  •  \V\  =  0(|7r|2)  =  0(l2),  as  the  index  vectors  can  parametrize  at  most  all 
combinations  of  indices  to  the  original  plan  it  by  Definition  7.  □ 

The  analyzed  communication  complexity  do  not  bring  any  new  terms  exponen¬ 
tially  dependent  on  any  of  the  parameters,  which  are  always  polynomial  provided  that 
l  is  polynomial  w.r.t.  Q.  Therefore  the  communication  complexity  of  the  proposed 
plan  repair  algorithm  remains  exponential  only  in  the  factor  of  number  of  coordination 
points  6  in  the  inner  repair  plan  and  tree-width  ui  of  the  agent  interaction  graph,  i.e., 
exp(<5w)  as  in  Eq.  16.  This  result  is  anticipated  as  the  communication  complexity  is 
usually  proportional  to  the  time  complexity. 

The  resulting  communication  complexity  of  the  generalized  repair  algorithm  proves 
the  communication  and  final  part  of  Hypothesis  1  with  an  additional  assumption  on  the 
polynomial  length  of  the  repaired  plan.  Additionally,  the  results  support  the  experi¬ 
mental  results  from  [10]  and  concur  with  the  remaining  hypotheses,  similarly  as  in  the 
case  of  the  time  complexity.  The  core  hypothesis  of  [10]  states  that  the  communication 
overhead  is  lowered  by  plan  repair  producing  more  preserving  repairs  in  comparison  to 
replanning.  Since  the  communication  complexity  of  replanning  is  exponentially  depen¬ 
dent  on  6  this  hypothesis  is  supported  by  the  analysis  as  far  as  at  least  one  coordination 
point  is  spared,  because  decreasing  the  exponential  factor  by  one  exp((5  —  l)w)  dom¬ 
inates  any  additional  polynomial  factors  added  by  the  plan  repair  techniques.  This  is 
true  only,  if  the  problems  are  tightly  coordinated  cj  0.  If  it  is  the  contrary,  the  expo¬ 
nential  factor  is  negligible  even  if  5  is  not  decreased  by  the  preservation  of  the  repair, 
formally  exp(<5u;)  — >  1  iff  S  — >  0  or  oj  — >  0. 

The  arguments  on  decreasing  the  amount  of  communication  used  for  the  remaining 
three  hypothesis  of  this  article  copies  those  in  the  time  complexity  analysis.  Hypothe¬ 
sis  2  targets  taking  only  a  subset  of  A,  which  can  in  effect  lower  the  tree-width  oj  if  the 
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Algorithm  2  Plan  execution  and  monitoring  scheme. 

Input:  An  initial  multiagent  planning  problem  II  =  (P,A,I,G),  vectors  U  and  V, 
and  Planning  First  multiagent  planner  by  [13]  as  MA-Plan. 

1:  7 r  =  MA-Plan(II) 

2:  if  MA-Plan  failed  then  return  fail 

3:  k  =  1 
4: 

5:  repeat 

6:  agents  perform  7t[fc] 

7:  if  failure  detected  then 

8:  retrieve  the  current  state  F  from  the  environment 

9:  7T  =  Generalized-Repair((n,  n,  F,  k),F,  G\  MA-Plan) 

10:  k  =  1 

it:  else 

12:  k  =  k  +  1 

13:  end  if 

14:  until  Generalized-Repair  failed  or  k  >  |7r| 


remaining  agents  are  less  coupled,  and  therefore  lower  the  communication  complexity. 
In  Hypothesis  3,  the  length  <5  of  the  inner  repair  plan  should  be  minimized  if  failures 
of  actions  with  long  dependency  trees  are  fixed  as  soon  as  possible.  In  Hypothesis  4, 
smaller  S  should  be  achieved  by  possibly  short  repair  plans  by  appropriate  reusing  of 
the  original  plan. 


5  Experimental  Analysis 

The  further  sections  of  the  article  focus  on  the  remaining  hypotheses  and  its  experi¬ 
mental  analysis. 

The  experiments  were  conducted  in  a  synthetic  setting,  a  simulated  world  with  a 
group  of  agents  using  a  plan  execution,  monitoring  and  repair  loop  (see  Algorithm  2). 
The  world  was  modeled  as  fully  observable.  All  failures  of  plan  execution  were  gen¬ 
erated  by  the  simulator  according  to  a  uniform  distribution  over  time  and  parametrized 
by  a  probability  p  of  failure  occurrence  in  each  step  for  each  experiment.  The  failures 
were  handled  by  the  agents  immediately  upon  detection. 

A  failure  was  simulated  by  not-execution  of  some  of  the  agent  actions  from  the 
actual  plan  step.  The  particular  actions  were  chosen  according  to  an  uniform  probabil¬ 
ity  distribution  over  the  individual  actions  within  a  joint  action.  As  showed  by  [10], 
failure  models  with  more  radical  impacts  on  the  environment  (e.g.,  state  perturbations) 
decrease  usability  of  the  plan  repairing  approaches.  Our  motivation  in  this  work  is  to 
study  types  of  plan  repairing  strategies,  therefore  we  stick  only  to  action  failures. 

For  the  implementation  of  the  experimental  setup  and  the  repairing  algorithms, 
we  used  a  centralized  world  simulator  integrating  the  multiagent  domain-independent 
planner  Planning  First  [13]  denoted  as  MA-Plan.  Each  agent  run  in  its  own  thread 
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Relation  of  time  and  communication  complexity 
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Figure  4:  Relation  between  communicated  bytes  and  computation  time  for  solving  the 
plan  repairing  problems. 


and  deliberated  asynchronously.  The  experiments  were  executed  on  8-core  processor 
at  3.6GHz  with  Java  Virtual  Machine  limited  to  2.5GB  of  RAM. 

For  the  experiments,  we  used  four  planning  domains.  Three  of  them  originate  in 
the  standard  single-agent  IPC  planning  benchmarks.  Similarly  to  the  evaluation  of  the 
MA-Plan  algorithm  in  [13],  we  chose  domains,  which  are  straightforwardly  modifiable 
to  a  multiagent  setting:  LOGISTICS,  ROVERS,  and  SATELLITES.  Additionally,  we  have 
extended  the  set  of  benchmarks  by  COOPERATIVE  PATHFINDING  coordination  domain 
on  a  grid  [10]. 

The  experimental  measurements  were  based  on  two  metrics  focusing  on  the  target 
efficiencies:  cumulative  time  consumed  by  the  particular  plan  repairing  algorithms  dur¬ 
ing  a  single  run  of  the  simulation,  i.e.,  the  overall  time  spent  in  the  algorithm  (inch  the 
underlying  planning  process)  excluding  the  initial  planning  phase  of  the  scheme  (Al¬ 
gorithm  1).  The  second  metric  was  communication  complexity  of  the  process,  that  is 
the  volume  of  communicated  information  in  bytes  among  the  involved  agents  during 
the  plan  repairing  processes.  Those  are  mainly  the  messages  generated  by  the  DisCSP 
solver  of  the  Planning  First  MA-Plan  planner  and  an  additional  synchronization  pro¬ 
cesses  minimizing  the  number  of  agents  involved  in  the  plan  repairing  process. 

To  account  for  differences  in  essential  computational  and  communication  complex¬ 
ity  of  the  domains,  we  conducted  a  relationship  experiment  between  these  two  mea¬ 
sures.  Figure  4  depicts  the  results  and  demonstrates  that  there  is  no  essential  discrep¬ 
ancy  between  the  computational  and  communication  complexity  of  the  plan  repairing 
solutions.  That  means  the  following  results  are  not  biased  by  problems  extremely  hard 
in  time  and  simple  in  communication  and  vice  versa. 
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5.1  The  Number  of  Repairing  Agents 

Regardless  of  the  theoretical  results  presented  in  [2],  showing  that  the  computational 
complexity  of  CSP-based  multiagent  planning  is  not  exponentially  dependent  on  the 
number  of  the  agents,  in  practical  experiments,  we  faced  a  non-negligible  dependence 
of  this  number  and  required  communication  and  computational  effort.  The  first  set  of 
experiments  analyzes  this  relation  by  means  of  Hypothesis  2. 

5.1.1  Used  Plan  Repair 

To  validate  Hypothesis  2,  we  have  prepared  an  extensive  set  of  plan  repairing  strategies 
stemming  from  the  generalized  repair.  They  can  be  divided  into  three  main  groups: 
one  without  agent  count  minimization,  and  two  with  agent  minimization.  First  of  the 
minimization  groups  reuse  the  original  plan  purely  as  a  suffix  and  the  other  one  purely 
as  a  prefix. 

The  differences  of  the  strategies  within  one  of  the  groups  of  repair  strategies  lies 
in  the  preference  between  agent  minimization,  size  of  preservation  of  the  original  plan 
and  bound  on  the  maximal  length  of  the  newly  generated  repairing  plan  component 
7 rgx.  This  approach  restrain  bias  possibly  caused  by  unbalanced  influences  of  the  agent 
minimization  on  various  types  of  plan  repair  strategies. 

The  approach  minimizing  the  number  of  involved  agents  was  based  on  the  notion 
of  a  set  of  supporting  agents.  The  iterative  process  from  Algorithm  3  was  extended 
with  an  iteration  starting  only  with  a  set  of  agents  providing  at  least  one  action,  which 
can  contribute  to  the  repairing  plan  by  a  required  proposition(s),  i.e.,  support  part  of 
G  0  7rsuf .  If  such  team  of  agents  is  not  able  to  solve  the  plan  repairing  problem,  the 
team  is  extended  by  additional  agents  supporting  any  of  the  current  agents  in  the  team 
by  means  of  contributing  to  prepositions  in  their  preconditions.  If  such  additional  agent 
does  not  exist  and  the  team  is  still  not  containing  all  the  agent  from  A,  a  random  agent 
is  added  into  the  team  and  the  process  continues. 

5.1.2  Results  and  Discussion 

The  experiments  were  conducted  in  all  presented  planning  domains  and  for  all  com¬ 
binations  of  agent  counts,  i.e.,  two  to  four  agents  giving  twelve  domain  and  problem 
instances.  Each  of  the  group  contained  six  variances  of  the  repairing  strategies  giving 
216  experiments  in  total.  Each  of  the  experiments  was  averaged  over  5  measurements 
with  different  random  seeds. 

Figure  5  shows  results  of  the  first  batch  of  experiments.  The  first  group  of  repair¬ 
ing  strategies  not  minimizing  number  of  involved  agents  (red  color)  is  in  most  mea¬ 
surements  in  both  computational  and  communication  metrics  worse  than  the  baseline 
replanning  strategy.  The  suffix  preserving  algorithms  minimizing  numbers  of  agents 
(green  color)  is  on  the  other  hand  nearly  in  all  measurements  better  in  both  metrics  than 
the  baseline  strategy  with  an  exception  in  the  simplest  COOPERATIVE  PATHFINDING 
problems.  The  group  of  plan  repairing  strategies  minimizing  the  number  of  involved 
agents  and  preserving  prefix  part  of  the  original  plan  (blue  color)  is  on  tie  or  better  with 
the  replanning  in  rather  loosely  coupled  domains.  The  communication  and  computa¬ 
tional  overheads  decrease  with  decreasing  coupling  of  the  domains.  However  in  tighter 
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coupled  domains,  the  strategies  fall  behind  the  replanning  baseline.  In  LOGISTICS  do¬ 
main,  only  33%  of  the  strategies  are  better  by  communication  overheads  and  only  18% 
by  means  of  computational  overheads.  With  increasing  coupling  the  approach  lose 
more.  These  results  support  the  second  hypothesis. 

Additionally,  the  results  revealed  that  the  prefix  preserving  approaches,  as  not  the 
best  in  all  agent  minimizing  approaches,  in  most  of  the  experiments  has  one  of  the  best 
approaches  outperforming  the  best  suffix  preserving  approach.  In  LOGISTICS  domain, 
the  separation  between  the  best  prefix  and  best  suffix  preserving  plan  repairing  strategy 
is  about  a  half  an  order  of  magnitude  in  favor  of  the  one  prefix  preserving  approach. 
On  the  other  hand,  in  COOPERATIVE  PATHFINDING,  suffix  preserving  approaches  gain 
more  than  one  order  of  magnitude. 

5.2  Repairing  of  Actions  with  High  Dependency  Metrics 

The  intuition  behind  Hypothesis  3  can  be  rephrased  as  follows:  If  an  action  fails  and 
it  has  potentially  a  lot  of  future  dependencies,  possibly  of  other  agents  or  even  in  the 
goal,  trying  to  fix  it  as  soon  as  possible  is  rather  better  idea,  than  ignore  it  and  try  to 
repair  it  later.  The  experiments  in  this  section  were  conducted  to  validate  this  concept. 

5.2.1  Used  Plan  Repair 

The  most  straightforward  approach  here  is  to  compare  the  two  plan  repairing  strategies 
re-using  the  whole  original  plan  either  as  a  prefix  or  as  a  suffix.  These  strategies  are 
again  parameterizations  of  the  plan  generalized  repair  such  that  there  is  no  iteration 
over  various  a  and  v,  but  only  two  fixed  values.  The  pure  prefix  strategy  uses  fixation 
u  =  1 7r |  —  k,  v  =  0  and  the  pure  suffix  strategy  uses  fixation  u  =  0,  v  =  |7r|  —  fc. 

In  order  to  explain  the  result,  we  have  to  present  more  details  on  the  LOGISTICS  do¬ 
main.  In  the  LOGISTICS  problem  with  three  agents  used  in  the  experiments,  the  agents 
control  two  trucks  T i  and  T 2  and  one  airplane  A.  There  are  two  cities,  each  with  one 
storage  depot  (dx  andd2)  and  one  airport  (aj  anda2).  The  trucks  can  move  m(from,  to) 
only  within  their  cities,  i.e.,  between  one  depot  and  one  airport.  The  airplane  can  fly 
f (from,  to)  among  all  airports  in  the  environment,  but  cannot  land  at  the  depots.  All 
vehicles  can  load  l(package,  location)  and  unload  u(package,  location)  a  package  at  a 
location.  Initially,  there  is  one  package  p  at  one  of  the  depots  and  the  goal  is  to  trans¬ 
port  it  to  the  other  depot  in  the  other  city.  The  trucks  start  at  the  depots  and  the  airplane 
starts  at  one  of  the  airports.  A  typical  multiagent  plan  solving  this  particular  instance 
is  depicted  in  the  matrix  form  in  Figure  6. 

5.2.2  Results  and  Discussion 

To  validate  Hypothesis  3,  we  run  the  pure  prefix  and  pure  suffix  preserving  repairing 
strategies  in  all  the  testing  domains.  We  have  measured  ratio  of  successful  repairs  of 
these  two  repairing  strategies  against  replanning  by  means  of  computation  time.  In 
Figure  7,  we  summarize  the  results  of  these  experiments. 

In  the  ROVERS  and  SATELLITE  domains  the  plans  solving  the  problem  do  not  con¬ 
tain  any  significant  actions  by  means  of  number  of  future  dependencies  to  the  overall 
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number  of  agents  involved  and  preserving  suffix  (back-on-track)  of  the  original  plan  and  the  blue  group  contains  strategies  also  minimizing 
number  of  agents  and  preserving  prefix  (lazy  repair)  of  the  original  plan. 


A  : 
Ti  : 
T2  : 
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(P.d2) 


e  e  £  l(p,ai)  f(ai,a2)  u(p,a2] 

l(p,di)  m(di,ai)  u(p,  ai)  e  e  e 
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Figure  6:  A  multiagent  plan  solving  the  initial  LOGISTICS  problem  used  in  the  ex¬ 
periments.  Empty  actions  are  denoted  as  e.  The  overlines  mark  public  actions.  The 
numbers  in  the  last  row  represent  particular  counts  of  steps,  i.e.,  number  of  actions 
|7r|  —  k,  to  the  end  of  the  plan. 
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Figure  7:  Comparison  of  success  ratio  against  replanning  between  suffix  preserving 
(green,  back-on-track)  and  prefix  preserving  (blue,  lazy)  plan  repairing  with  variable 
length  m  =  |7r|  —  k  of  the  repaired  plan  segment. 
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count  of  actions  in  the  plan.  In  SATELLITES,  all  actions  are  private  and  therefore  actions 
of  one  agent  depend  only  on  other  actions  of  the  same  agent.  The  highest  dependency 
metrics  is  in  such  case  maxa6|j’1_1  A,  dep„.(a)  =  M  f°r  the  first  action  a  of  each  satel¬ 
lite  in  the  multiagent  plan  ir.  The  individual  plans  of  the  agents  are  relatively  short 
(three  to  four  actions)  and  therefore  the  dependency  metrics  is  never  higher  than  four 
and  it  is  zero  for  public  projection  7rpub  of  the  plan  tt. 

Multiagent  plans  for  the  ROVER  problems  contain  several  public  actions  at  the  end 
of  the  plan,  representing  always  only  one  rover  communicating  at  one  time  point.  Al¬ 
though  the  plans  solving  the  ROVERS  problems  contain  public  actions,  there  are  again 
no  long  dependencies  among  the  actions.  The  dependencies  in  the  private  part  of  the 
plan  contain  three  components,  each  containing  three  to  four  private  actions.  Conse¬ 
quently,  the  private  dependencies  are,  similarly  to  the  SATELLITE  problems,  maximally 
four  actions  long.  The  dependencies  among  the  public  actions  are  even  shorter,  as  there 
is  the  same  number  of  public  actions  as  agents,  which  means  maximally  three-action 
public  dependencies  for  three  agents.  The  dependency  link  between  one  public  ac¬ 
tion  and  one  dependent  private  component  increases  the  maximal  dependent  length 
to  maximally  seven  actions  (four  private  actions  of  the  component  bound  to  three 
public  actions  successively  dependent  on  each  other).  Using  the  dependency  metrics 
maxagy.*  ^  Ai  dep^ (n)  =  4+n  and  maxa6yr._i  A.  dep„.Pub(a)  =  n  for  the  public  plan 
and  for  n  rovers. 

In  such  repair  problem,  even  if  one  of  the  leading  actions  in  a  private  component 
fail,  prefix  preserving  (i.e.,  lazy)  approach  solves  nearly  the  complete  problem  only  by 
reusing  the  original  plan.  More  precisely,  it  reuses  the  original  solution  for  the  rest  of 
the  private  components  and  all  the  public  actions  except  one  of  the  failed  agent.  As  the 
results  show,  the  prefix-based  repair  is  always  better  then  the  suffix-based  and  the  ratio 
between  these  two  is  rather  stable  over  different  points  in  the  plan. 

The  situation  changes  in  the  LOGISTICS  domain.  In  LOGISTIC  with  three  agents 
and  one  package,  there  is  a  chain  of  dependent  actions.  Particularly,  u(p,  d2)  depends 
on  I (p,  a2),  which  depends  on  u(p,  82)  and  so  on  to  the  first  action  of  the  plan  I (p,  di). 
The  dependency  chain  has  six  public  actions  in  the  example  plan  and  occupy  the  com¬ 
plete  length  of  it.  As  the  results  show  in  Figure  7,  there  are  two  distinctive  peaks  where 
the  suffix  preserving  repair  outperforms  the  prefix  preserving  repair,  additionally  with 
a  increasing  trend.  The  first  one  is  for  repair  plans  of  length  m  =  3  and  the  other  one  is 
for  m  =  6.  As  presented  in  Figure  6,  these  lengths  correspond  to  the  package  handover 
points  in  the  plan,  more  precisely,  to  repair  of  failing  unloads  u(p,  ai)  and  u(p,  32). 

These  public  actions’  dependency  metrics  is  maxo£y» _  depjrPub(a)  =  g — -, 

which  in  contrast  to  ROVERS  is  dependent  not  only  on  the  number  of  the  agents,  but  on 
the  length  of  the  public  plan.  Ignoring  a  failure  of  unloading  by  the  prefix  preserving 
(i.e.,  lazy)  approach  causes  the  package  is  left  in  the  last  vehicle  and  the  rest  of  the 
team  finishes  the  executable  remainder  of  the  plan,  which  in  principle  means  the  vehi¬ 
cles  are  moving,  but  they  are  not  transporting  the  package.  On  the  other  hand,  in  the 
same  circumstances,  the  suffix  preserving  repair  (i.e.,  back-on-track)  only  repeats  the 
unload  action  and  successfully  continues  with  the  rest  of  the  original  plan  ending  in  a 
goal  state. 

One  can  argue  that  the  complement  load  actions  should  be  repaired  more  efficiently 
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using  this  same  argumentation  as  well.  This  is  very  true,  however  this  phenomenon  is 
not  captured  in  the  results,  because  of  a  particular  implementation  of  the  MA-Plan 
planner.  The  explanation  is  based  on  the  fact  the  used  planner  efficiency  is  more  de¬ 
pendent  on  small  differences  in  number  of  involved  agents,  than  the  number  of  planned 
actions.  In  the  case  of  m  =  3  (the  u(p,  22)  action),  2  agents  are  needed  to  do  lazy  re¬ 
pair,  because  firstly  the  executable  remainder  of  the  original  plan  is  reused  to  the  last 
state  without  the  package  and  than  the  planner  has  to  be  used  to  generate  repair  plan 
7Tgx  reverting  all  the  moves  and  planning  to  one  of  the  goal  states  again.  Such  plan  has 
to  firstly  unload  the  package  from  the  airplane  A  and  then  transport  it  successfully  by 
the  truck  T2  to  the  goal  destination  d2.  On  the  other  hand,  the  pure  suffix  preserving 
approach  generates  only  a  plan  repeating  the  unload  action  u(p,  ^2)  and  afterward  con¬ 
tinues  with  the  original  plan  as  a  suffix.  This  planning  problem  involves  only  one  agent, 
in  particular,  the  airplane  A  carrying  out  unload  of  the  package.  The  same  principle  can 
be  applied  to  m  =  6,  but  with  all  three  agents  for  pure  preserving  (lazy)  repair,  but  only 
2  agents  for  pure  suffix  repair  (back-on-track). 

In  the  last  problem  of  COOPERATIVE  PATHFINDING,  the  length  of  a  sequence  of 
dependent  actions  correspond  to  the  length  of  the  plan  as  well,  as  all  the  actions  in 
such  plan  are  public  and  inter-dependent.  Nevertheless,  this  is  quite  different  “order 
of  dependency”,  than  in  SATELLITES  for  example.  In  SATTELITES,  all  the  actions  are 
dependent  as  well,  but  only  within  one  agent,  whereas  here,  the  actions  are  dependent 
across  the  agents.  In  the  experimental  results  of  the  COOPERATIVE  PATHFINDING 
a  trend  arises.  In  such  dense  types  of  inter-dependent  problems,  the  longer  are  the 
repaired  plans,  the  more  the  suffix  repair  algorithm  gains  against  the  prefix  one. 

The  results  of  these  experiments,  namely  of  LOGISTICS  and  COOPERATIVE  PATHFIND¬ 
ING,  moderately  support  the  third  hypothesis  of  the  paper. 

5.3  Partitioning  of  the  Original  Plan 

It  is  not  intuitively  clear  what  is  a  good  strategy  for  reusing  the  original  plan  parts, 
moreover  related  to  a  particular  planning  domain.  The  experiments  conducted  in  these 
sections  provide  several  insights  into  this  issue  and  focus  especially  on  answering  the 
Hypothesis  4. 

5.3.1  Used  Plan  Repair 

A  battery  of  plan  repairing  strategies  was  prepared  to  validate  Hypothesis  4.  We  pa¬ 
rameterize  how  much  generalized  repair  reuse  the  original  plan.  Such  parameterization 
(based  on  the  U  and  V  indice  vectors)  lead  to  a  two-dimensional  discrete  space  of  dif¬ 
ferent  plan  repairing  strategies,  as  depicted  in  Figure  8,  representing  a  structure  of  the 
repaired  plan. 

Each  of  the  nine  diagrams  in  the  figure  describes  a  variation  on  a  resulting  plan 
repaired  by  one  particular  parameterization  of  the  algorithm  in  the  context  of  execution 
of  the  original  plan.  The  execution  starts  with  a  world  in  the  initial  state  I  and  it  is 
anticipated  to  continue  with  help  of  the  original  plan  to  the  last  state  Sm,  which  is  one 
of  the  goal  states,  i.e.,  Sm  D  G.  However  during  execution  of  an  action  following  a 
state  Sk,  execution  failed  and  the  state  of  the  world  ends  up  not  in  the  state  Sk+ 1,  but 
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Figure  8:  Scheme  of  a  two-dimensional  space  representing  plan  repairing  strategies 
preserving  different  parts  of  the  original  plan  and  reusing  it  in  different  ways.  The  blue 
segments  represent  prefix  re-usage  and  the  green  ones  the  suffix  re-usage.  The  notable 
states  are:  initial  state  /,  last  achieved  state  S*.  induced  by  the  original  plan,  exceptional 
state  F  after  a  failure  and  the  last  anticipated  state  Sm  D  G,  provided  that  the  original 
plan  would  be  executed  without  a  failure. 


in  a  state  F,  out  of  the  anticipated  sequence  of  states  and  actions.  To  fulfill  the  goal, 
the  agents  use  one  of  the  plan  repairing  strategies,  which  under  the  condition  of  perfect 
execution,  would  transform  the  world  from  F  to  a  Srn  D  G. 

In  Figure  8,  there  are  two  dimensions  depicted.  One  of  the  dimensions  represent 
the  number  of  actions  which  has  to  be  reused  from  beginning  of  the  original  plan  as  a 
prefix  corresponding  to  fixation  of  the  iteration  parameter  U  =  (|7r|  —  fe).  The  other 
dimension  represents  number  of  actions  re-used  as  suffix  of  the  final  repairing  plan, 
i.e.,  fixed  iteration  parameter  V  =  ( | tt  k).  In  the  presented  scheme,  7rpre  from  the 
Algorithm  3  is  denoted  as  a  blue  line,  7rsuf  as  a  green  line  and  7Tfixas  a  black  thick 
arrow.  Since  both  the  dimensions  reuse  the  same  original  plan,  the  space  is  always  a 
square  with  a  side  of  the  length  |7r|  —  k. 

There  are  four  extremes  in  the  repair  strategy  space.  The  strategy  at  position  (0, 0) 
effectively  degenerates  from  7rpre  •  7Tfix  •  7rsuf  to  7rgx.  Such  process  correspond  to  replan¬ 
ning  from  the  scratch.  The  strategies  at  positions  ( |7r|  —  k.  0)  and  (0,  \tt\  —  k)  represent 
pure  repairs  7rpre  •  7Tfix  and  7Tflx  •  7rsuf  respectively.  The  last  extreme  at  (|7r|  —  k,  \tt\  —  k) 
represent  an  strategy,  which  firstly  uses  the  original  plan  ignoring  inapplicable  actions, 
then  using  a  newly  generated  plan  7Tfjx  returns  to  the  anticipated  state  after  execution 
of  the  failed  action  Sk  and  than  it  reuses  the  original  plan  again  to  get  to  the  goal  state, 
i.e.,  the  algorithm  generates  a  full  overlap  of  the  prefix  and  suffix  plans. 

Beside  the  extremes,  also  the  (0,  m),  (1,  m  —  1), ...,  (m  —  1, 1  ),(ro,  0)  diagonal  for 
m  =  |tt|  —  k  in  the  space  is  important  from  perspective  of  the  ongoing  discussion. 
All  the  strategies  lying  on  this  diagonal  can  re-use  all  the  actions  of  the  original  plan 
exactly  once  and  in  the  original  order.  Meaning,  the  original  plan  is  neither  overused 
nor  underused.  Formally,  we  define: 

Definition  12.  ( m-normal  generalized  plan  repair)  Let  E  =  (II,  n,  F,  k)  be  a  multia- 
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gent  plan  repairing  problem,  then  an  algorithm  R  is  a  m-normal  generalized  plan  re¬ 
pair,  iff  R  solves  the  problem  E  by  a  multiagent  plan  it'  with  decomposition  7rpre  •  7 rgx  • 
7Tsuf  and  at  the  same  time  ( 1 7rpre | ,  1 7rsuf  | )  e{(0,m),  (l,m-  1), (to-  1, 1),  (m,  0)}, 
where  m  =  1 7r  |  —  k. 

5.3.2  Results  and  Discussion 

To  validate  the  third  and  last  hypothesis,  we  used  a  randomized  sampling  of  the  strategy 
space  and  searched  for  more  successful  algorithms  lying  on  the  m-normal  diagonal. 
The  results  are  present  in  Figure  9. 

The  sampling  experimental  process  measured  for  each  encountered  repairing  prob¬ 
lem  the  computation  time  of  the  replanning  strategy.  After  this  base-line  measurement, 
a  tested  repairing  strategy  was  run  with  a  bound  on  the  computation  time  based  on 
the  replanning  run-time.  If  the  strategy  performed  better,  a  cell  in  the  result  map  was 
incremented  by  one.  In  effect,  this  process  rendered  the  presented  normalized  results. 
During  the  experimental  execution  and  plan  repairing,  we  used  different  lengths  of  the 
original  plan,  i.e.,  the  repair  was  done  for  various  |7r|  —  k.  Therefore,  the  resulting 
maps  depict  a  continuous  space,  as  the  results  with  higher  and  lower  |7r|  —  k  values 
were  merged  into  the  most  representative  m  value  corresponding  to  the  initial  multia¬ 
gent  plan  generated. 

As  the  maps  show,  the  hypothesis  clearly  holds  for  coupled  domains  with  longer 
plans  (LOGISTICS,  and  ROVERS).  In  the  coupled  domain  of  COOPERATIVE  PATHFIND¬ 
ING,  the  diagonal  is  also  present,  but  because  of  shorter  repaired  plans,  it  degenerated 
considerably.  In  the  experiment  with  SATELLITES,  the  diagonal  is  not  present. 

These  results  support  Hypothesis  2  with  an  auxiliary  observation,  that  the  effect  is 
decreasing  as  the  coupling  of  the  domain  decreases. 

6  Related  Work 

There  are  several  approaches  capable  to  drive  multiagent  team  activities  in  an  environ¬ 
ment  with  uncertain  dynamics. 

Firstly,  there  is  a  body  of  literature  dealing  with  and  extending  models  of  decen¬ 
tralized  partially  observable  Markov  Decision  Processes  (Dec-POMDPs)  [1],  A  Dec- 
POMDP  model  leads  to  computation  of  a  policy  for  the  agents  in  the  environment 
ensuring  that  by  following  it,  the  team  reaches  (joint)  rewards.  The  model  assumes 
only  partial  observability  of  the  environment  and  it  is  capable  of  capturing  various 
eventualities  which  can  occur  in  the  environment.  The  eventualities,  however,  have  to 
be  probabilistically  known  a  priori,  such  that  a  model  of  action  outcomes  can  be  con¬ 
structed  before  planning.  Dec-POMDP  solvers  do  not  scale  well  to  larger  problems, 
especially  when  the  model  of  run-time  action  failures  is  a  priori  unknown.  The  plan  re¬ 
pairing  algorithms  proposed  in  this  article  do  not  require  an  explicit  failure  model  as  an 
input.  The  price  the  plan  repairing  algorithms  pay  is  their  inefficiency  in  problems  with 
failures  taking  the  system  far  from  the  presumed  evolution  based  on  the  original  plan 
and  rate  of  such  failures  as  shown  in  [10].  A  simplified  single-agent  models  described 
by  Markov  Decision  Processes  (MDP)  were  tackled  by  scalable  techniques  of  online 
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Figure  9:  The  maps  present  prefix  (it  on  y- axis)  vs.  suffix  (v  on  x-axis)  preserving 
repairing  algorithms  by  a  success  rate  against  replanning  in  the  repair  time  for  all  do¬ 
mains  with  three  agents  and  p  =  0.3.  Red  color  represents  repair  strategies  faster  then 
replanning.  The  top-left  to  bottom-right  diagonal  represent  algorithms  neither  overus¬ 
ing  or  underusing  the  original  plan. 
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policy  replanning  by  problem  determinization  e.g.,  in  FF-Replan  [16].  Determinization 
approaches  are  related  to  the  online  replanning  scheme.  Since  we  do  not  assume  the 
probabilistic  model  of  the  failures  for  the  multiagent  plan  repair  algorithms,  we  cannot 
prepare  the  (partial)  policies  using  the  determinization  of  the  model.  Our  approach 
focuses  on  efficiency  of  the  replanning  process  per  se,  when  it  is  needed  because  of  a 
failure.  In  other  words,  plan  repair  assumes  an  dynamic  and  optimistic  determinization 
of  the  problem  based  on  the  original  plan. 

Secondly,  single-agent  Contingency  [11],  Fault  Tolerant  [7]  and  Conformant  [14] 
planning  techniques  facilitate  classical-style  planning  for  domains  with  non-probabilistic 
uncertainty  in  either  action  outcomes  or  state  the  system  happens  to  be  in.  However, 
again,  in  order  to  plan  for  actions  in  such  domains,  the  possible  contingencies  and 
action  models  in  the  environment  must  be  known  before  the  planning  phase.  In  the 
multiagent  plan  repair,  it  is  assumed  any  possible  outcome  of  an  action  in  general.  In 
Contingency,  Fault  Tolerant,  or  Conformant  planning  such  assumption  would  lead  to 
actions  possibly  taking  the  environment  to  any  state.  Thus  a  complete  graph  represent¬ 
ing  all  transitions  would  render  the  techniques  unusable. 

Lastly,  the  idea  of  macro  actions  used  for  single-agent  plan  repair  [15]  build  upon 
the  positive  results  of  planning  with  prescribed  sequences  of  primitive  actions.  The 
technique  stemmed  from  the  area  of  integrating  planning  and  machine  learning  [4]  and 
was  adapted  to  describe  parts  of  the  repaired  plan  by  fixed  macro  actions.  In  respect  to 
the  recency  of  the  techniques,  it  is  not  surprising  that  it  was  not  extended  for  multiagent 
planning  yet. 


7  Conclusion 

Based  on  the  theoretical  and  experimental  results,  we  can  come  up  with  a  summary  of 
heuristic  approaches  in  form  of  simply  usable  advices  decreasing  computation  and/or 
communication  overheads  during  repairing  of  multiagent  plans.  These  advices  can 
be  used  for  various  plan  repairing  approaches  targeting  systems  with  planning  agents 
reusing  the  original  plan  in  form  of  combination  of  prefix  and  suffix  as  we  proposed  in 
the  generalized  repair.  The  results  were  verified  for  plan  repairing  techniques  utilizing 
preservation  of  the  original  plan  and  using  an  CSP-based  multiagent  planner  to  fill 
prospective  discontinuities  in  the  repairing  plan.  The  advices  are: 

1.  Use  plan  reuse  based  on  generalized  repair  only  with  plans  of  polynomially 
bounded  length. 

2.  Prefer  smaller  numbers  of  involved  agents  in  the  plan  repairing  process. 

3.  Prefer  prefix  preserving  repairing  techniques  when  repairing  failures  with  long 
dependencies  among  different  agents. 

4.  Prefer  m-normal  generalized  plan  repairing  algorithms. 

This  work  opens  several  interesting  questions  left  for  the  future  work.  Most  notably, 
how  would  another  implementation  of  the  underlying  multiagent  planner  affect  the 
results  and  would  it  be  possible  to  integrate  principles  from  single-agent  search  effort 
estimation  approaches,  e.g.,  as  in  [6]  to  provide  more  precise  hints  how  to  repair  during 
the  execution  and  repairing  process. 
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Abstract 

Multiagent  planning  for  the  MA-STRIPS  model  requires  a  process  of  dis¬ 
tributed  plan  generation  for  a  cooperative  team  of  agents.  Heuristic  multiagent 
state-space  search  is  an  obvious  candidate  providing  scheme  both  for  agents’  lo¬ 
cal  search  and  inter-agent  protocol  for  distribution  of  the  search.  However,  dis¬ 
tributed  heuristic  estimation,  application  of  multi-heuristic  search  and  efficient 
utilization  of  the  decentralized  computation  power  are  still  open  challenges. 

The  Multiagent  Distributed  and  Local  Asynchronous  (MADLA)  Planner 
runs  a  distributed  variant  of  a  state-space  forward-chaining  multi-heuristic  search 
with  two  versions  of  a  well  known  FastForward  relaxation  heuristic,  one  estimat¬ 
ing  the  particular  agent’s  local  subproblem  and  another  estimating  the  global 
heuristic  values.  We  propose  a  general  asynchronous  scheme  for  such  multi¬ 
heuristic  multiagent  searches  and  provide  proofs  of  soundness  and  completeness. 
The  asynchronism  allows  efficient  utilization  of  agents’  computation  power  while 
waiting  for  responses  from  other  agents.  Also,  we  provide  a  novel  distribution 
scheme  for  the  FastForward  heuristic,  inspired  by  the  Set- Additive  variation  of 
the  FastForward  heuristic  and  with  lazy  computation  of  partial  relaxed  plans. 

We  experimentally  compare  the  proposed  multi-heuristic  scheme  and  the 
two  used  heuristics  per  se.  The  results  show  the  proposed  solution  outperforms 
search  with  either  one  of  the  heuristics  used  separately,  positively  combining 
benefits  of  both  heuristics.  In  the  detailed  experimental  analysis,  we  show  limits 
of  the  planner  and  of  the  used  heuristics  based  on  particular  properties  of  the 
benchmark  domains.  In  a  comprehensive  set  of  multiagent  planning  domains 
and  problems,  we  show  that  the  MADLA  Planner  outperforms  all  state-of-the- 
art  MA-STRIPS  multiagent  planners. 

Keywords:  multiagent  planning,  automated  planning,  multiagent  systems, 
state-space  search,  multi-heuristic  search,  MA-STRIPS 
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1.  Introduction 


The  ability  of  machines  to  deliberate  about  sequences  of  actions — plans — is 
one  of  the  classical  areas  of  artificial  intelligence.  In  the  case  of  multiple  ma¬ 
chines,  or  agents,  interacting  both  in  the  plan  synthesis  and  execution  phase, 
we  talk  about  multiagent  planning  (MAP).  In  contrast  to  distributed  plan  syn¬ 
thesis,  in  the  multiagent  case  the  agents  are  restricting  their  computational 
processes  to  communicate  only  information  identified  as  public  and  not  reveal 
all  the  information  they  are  working  with.  Such  restriction  can  be  beneficial 
not  only  in  domains  requiring  it  by  definition,  but  also  to  increase  efficiency  by 
focusing  only  on  subproblems  relevant  for  particular  agents. 

The  recently  most  prevalent  model  for  MAP  goes  back  to  the  roots  of  re¬ 
search  in  planning  itself  —to  STRIPS.  The  STRIPS  model  [1]  formalized  plan¬ 
ning  by  a  propositional  description  of  the  world  and  the  actions,  together  form¬ 
ing  a  transition  system  representing  the  planning  problem.  An  extension  of 
STRIPS  towards  MAP  denoted  as  MA-STRIPS  [2]  generalizes  the  model  by 
allowing  more  then  one  finite  set  of  actions,  each  set  characterizing  capabilities 
of  one  agent.  Since  not  all  agents  must  necessarily  be  capable  of  influencing  the 
whole  environment,  some  parts  of  the  information  about  it  can  be  private  to 
a  subset  of  agents,  alongside  the  common  information  treated  as  public  to  all 
agents. 

Because  of  its  computational  complexity  [2],  MAP  planners  rely,  similarly 
to  classical  planning,  usually  on  automatically  derived  heuristic  functions  and 
algorithms  estimating  cost  to  a  goal  state.  The  heuristic  state-space  search  in 
MAP  was  proposed  as  a  form  of  well-known  A*  search  algorithm,  denoted  as 
Multiagent  Distributed  A*  (MAD-A*)  in  [3,  4]  and  as  Multiagent  Best-First 
Search  (MA-BFS)  in  [5,  4]. 

One  of  the  most  prominent  classical  planning  heuristics  still  in  use  is  Fast- 
Forward  (FF)  [6],  first  introduced  in  the  FF  planning  system.  In  MA-STRIPS 
literature,  classical  heuristics  were  used  in  the  MAD-A*  planner,  but  restricted 
only  to  projected  problems,  that  is  each  particular  agent  was  using  the  heuris¬ 
tics  only  on  the  part  of  the  MAP  problem  it  has  access  to.  Such  projection  can 
substantially  underestimate  the  cost  as  it  does  not  consider  any  other  agents’ 
private  actions.  A  seeming  remedy  is  to  distribute  the  process  of  heuristic  esti¬ 
mation  among  all  agents  such  that  each  agent  preserves  its  privacy  by  not  com¬ 
municating  anything  else  than  its  partial  heuristic  estimations.  A  distributed 
estimator  for  the  FF  heuristic,  proven  to  return  the  same  estimations  as  central¬ 
ized  FF,  is  MA-FF  [7]  (based  on  building  a  distributed  form  of  Relaxed  Planning 
Graphs  [8]).  MA-FF  is,  however,  practically  inefficient,  more  precisely,  the  re¬ 
sults  indicates  that  there  is  a  trade-off  between  the  quality  of  the  heuristic  esti¬ 
mation  and  the  efficiency  of  distributed  heuristic  computation,  following  similar 
results  in  classical  planning  with  an  additional  communication  overhead.  Two 
distributed  variants  of  MA-FF  focused  on  practical  improvement,  namely  lazily 
computed  lazyFF  [7,  5],  and  rdFF  using  recursive  distributed  computation  [5]. 
Although  both  heuristics  improves  efficiency  of  the  underlying  search  in  contract 
to  MA-FF,  the  projected  variant  of  FF  still  performs  better  in  non-negligible 
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number  of  planning  problems,  because  of  its  computational  ease  and  no  commu¬ 
nication  requirements  [5].  A  natural  solution  is  to  combine  the  projected  and 
distributed  variants  of  FF. 

The  principle  of  combining  more  heuristic  estimators  is  well  known  in  clas¬ 
sical  planning  as  the  multi-heuristic  search  [9,  10],  yet,  it  was  never  analyzed  in 
the  context  of  MAP.  Specifically,  the  particular  way,  how  to  combine  projected 
and  distributed  heuristic  estimators  in  multiagent  distributed  state-space  search 
resulting  in  an  efficient  planning  approach,  is  an  open  problem,  as  the  methods 
used  in  classical  planning  are  clearly  not  suitable  for  this  class  of  heuristics.  In 
this  paper,  we  address  this  issue  both  in  theory  using  a  general  model  and  in 
practice  using  a  particular  implementation  of  a  planner.  The  same  general  prin¬ 
ciple  can  be  easily  applied  also  in  areas  of  distributed  AI  other  than  multiagent 
planning. 

2.  Multiagent  Planning 

As  the  Multiagent  Distributed  and  Local  Asynchronous  (MADLA)  planner 
is  based  on  the  MA-STRIPS  planning  model,  we  reuse  most  of  the  formal 
definitions  form  [2]  and  amend  them  mostly  with  formalization  of  state  and 
action  projections  required  in  later  definitions  of  the  proposed  search  scheme 
and  heuristics. 

We  assume  a  set  of  n  cooperative  agents  with  common  goals  which  search 
for  a  multiagent  plan  solving  a  planning  problem  in  a  coordinated  fashion.  The 
search  is  decoupled  from  the  prospective  execution  of  the  plan,  similarly  as  is 
classical  (off-line)  planning.  A  multiagent  planning  problem  is  formally  defined 
as: 

Definition  1.  Let  a  quadruple  II  =  (P,  {A,}"=1 , 7,  G)  be  a  MA-STRIPS  plan¬ 
ning  problem  for  n  agents  A  =  {cq:}?=i,  where: 

•  P  is  a  finite  set  of  propositions  describing  facts  about  the  world  the  agents 
act  in,  a  state  of  the  world  will  be  denoted  as  s  C  P, 

•  A,-,  is  a  finite  set  of  actions  an  agent  a,;  can  perform 

—  each  action  a  £  A  =  (J"  Aj  has  the  standard  STRIPS  syntax  and 
semantics  a  =  (pre(a),  add(a),  del  (a,)) ,  where  pre(a)  C  P,  add(a)  C 
P,  del(a)  C  P  represent  preconditions,  add  effects,  and  delete  effects 
respectively, 

—  a  transition  function  apply  :  2  pxA  — >  2P  is  defined  as  apply(a,  s)  K >  s' 
provided  that  pre(a)  C  s  s.t.  s'  =  s  U  add(a)\del(a),  where  action  a 
is  applied  in  state  s  with  a  new  resulting  state  s', 

—  the  sets  of  actions  are  pairwise  disjoint,  that  is  Vz  ^  j  :  Aj  n  Aj  =  0, 

•  /CP  is  the  initial  state  of  the  world,  and 
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•  G  C  P  is  the  goal  condition  defining  the  goal  (final)  states  of  the  problem; 
a  state  s  is  a  goal  state  iff  G  C  s. 

A  set  of  all  actions  of  all  agents  will  be  denoted  as  A  =  (J  ”=1Aj.  Subse¬ 
quently,  a  global  planning  problem  of  a  multiagent  planning  problem  II  will  be 
defined  as  IIG  =  (P,  A ,  I,  G). 

In  order  to  define  a  set  of  fact  restricted  to  one  agent,  we  firstly  define  a  set 
facts(a)  =  pre(a)  U  add(a)  U  del(a)  which  denote  facts  required  and/or  affected 
by  an  action  a.  Then  all  facts  of  agent  cq  are  defined  as  I\  =  (JaeA.  facts(a). 

MA-STRIPS  provides  a  scheme  for  separation  of  private  (internal)  infor¬ 
mation  of  particular  agents  and  public  (common)  information  which  is  shared 
among  all  agents.  Private  facts  of  an  agent  cq  called  aj-internal  are  its  facts 
which  are  not  facts  of  any  other  agent,  formally  P)nt  =  Pt\  Ua  e.4\»i  Public 

facts  of  an  agent  cq  are  the  complement  P/ub  =  P,\P(lnt.  In  a  similar  manner, 
we  define  separation  of  private  and  public  actions  of  the  agents.  Private  actions 
of  agent  cq  are  such  actions  that  does  not  affect  and  are  not  affected  by  other 
agents  via  the  public  facts,  formally 

A)nt  =  {a|a  €  A,;,facts(a)  C  P(nt}. 

Conversely,  public  actions  are  A^ub  =  Aj\A(nt. 

States  and  actions  restricted  to  a  specific  set  of  facts  are  projections  on  that 
specific  set.  If  a  projection  is  on  a  set  of  facts  of  an  agent  a*  and  all  public 
facts,  formally  P[ro 1  =  Pi  U  (J ^  P/ub,  we  will  be  using  the  term  cq-projection 
of  a  state,  formally  defined  as  sai  =  sfl  pProj.  Similarly,  an  cq-projection  of  an 
action  a  €  A*  is  defined  as 

aai  =  ^pre(a)  (~l  Pfro\  add  (a)  fl  PP0i ,  del  (a)  n  PP0^  . 

A  solution  of  multiagent  planning  problem  is  a  multiagent  plan. 

Definition  2.  Let  a  sequence  of  actions  7r  =  (ai, . . . ,  afc)  be  a  multiagent  plan 
solving  a  multiagent  planning  problem  II  =  (P,  {Aj}"=1,  so,  G)  iff  the  sequence 
is  sound  that  is  pre(cq)  C  Si-\,  where  Sj  =  apply(si_i, at)  for  i  £  1  and 

G  C  Sk¬ 
in  the  rest  of  the  article,  we  will  use  the  term  plan  for  multiagent  plans  and 
in  ambiguous  cases  we  will  distinct  multiagent  plans  and  single  agent  plans. 
Additionally,  we  will  use  the  term  dead-end  state  s  to  indicate  that  no  plan 
exists  from  s  to  any  goal  state. 

A  computational  process  generating  solutions  for  planning  problems  is  a 
planner. 

Definition  3.  A  (distributed)  algorithm  is  a  multiagent  planner  iff  it  accepts 
a  multiagent  planning  problem  II  =  (P,  {A,}/=1 ,  /,  G)  as  an  input  and  produces 
a  multiagent  plan  7 r  solving  II  as  an  output.  Such  planner  is 
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sound  iff  any  produced  multiagent  plan  7r  of  the  multiagent  planner  is  a  sound 
plan  for  II, 

complete  iff  the  multiagent  planner  produces  a  plan  for  any  multiagent  prob¬ 
lem  II  s.t.  37t  which  is  solution  of  II, 


Similarly  as  in  definition  of  multiagent  plan,  we  will  use  the  term  planner 
for  multiagent  planners  in  unambiguous  cases.  Additionally,  we  assume  that 
c(a)  i — y  1,  i.e.,  the  optimality  criterion  is  the  number  of  the  actions  in  the  plan, 
or  length  1 7r |  =  k  of  a  plan  tt  =  (ai, . . . ,  a*),  but  all  the  presented  techniques 
are  easily  modified  to  the  general  case.  A  complement  to  the  definition  of  a 
cost  function  is  a  definition  of  a  heuristic  function  hn(s)  for  a  problem  II  as 
hn  :  2P  — ►  R+  estimating  the  cost  of  a  plan  from  a  state  s  to  a  goal  state.  An 
evaluation  hn(s)  K >  oo  will  denote  states  s  which  the  heuristic  hu  estimates 
as  dead  ends.  A  heuristic  estimator  of  a  heuristic  function  h  is  an  algorithm 
evaluating  the  heuristic  h. 

The  definition  of  a  multiagent  planner  wraps  up  the  formalization  for  the 
context  of  multiagent  planning  in  the  rest  of  the  article. 

3.  The  MADLA  Planner 

The  MADLA  planner  proposed  in  this  article  is  a  sound  and  complete  MA- 
STRIPS  multiagent  planner  (Definition  3).  The  planner  distributively  searches 
through  a  state-space  induced  by  a  multiagent  planning  problem  II  (Definition  1) 
and  if  a  solution  exists,  it  returns  a  multiagent  plan  n  solving  II  (Definition  2) . 
The  search  is  driven  by  a  combination  of  two  heuristics  hi  and  ho ,  the  first  one 
hi  works  on  single  agent’s  view  (projection)  of  the  problem  II  and  the  other  ho 
works  distributively  over  the  global  problem  iR' . 

Only  a  specific  pair  of  heuristics  comply  with  the  proposed  search1 . 

Definition  4.  A  heuristic  estimator  of  a  heuristic  function  h  for  a  planning 
problem  IIG  run  by  agent  a,  is  non-blocking  iff  the  computation  of  h  does 
not  block  the  computation  process  of  agent  a,;  for  the  whole  duration  of  the 
computation  of  h. 

For  example  a  global  heuristic  estimator  can  require  computation  of  parts 
of  the  heuristic  estimate  by  other  agents.  If  such  heuristic  algorithm  is  non- 
blocking,  it  does  not  wait  for  responses  from  other  agents  and  allows  the  agent 
to  run  asynchronously  while  waiting  for  the  responses,  that  is,  the  agent  can 
use  the  time  to  perform  some  other  computations. 

A  definition  of  dominance  is  the  same  as  in  the  classical  planning: 


The  answer  to  question  “why”  will  be  explained  in  Section  3.2  describing  the  MADLA 
search  in  detail. 
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Definition  5.  A  heuristic  function  h\  dominates  a  heuristic  function  h2  for  a 
planning  problem  II  iff  for  all  states  s  £  2P  hold  that  hi(s)  >  h2(s). 

Finally,  we  need  a  relation  between  two  heuristics  describing  their  relative 
computational  hardness: 

Definition  6.  A  heuristic  estimator  of  a  heuristic  function  h\  is  computation¬ 
ally  harder  than  heuristic  estimator  of  a  heuristic  function  h2  for  a  planning 
problem  II  iff  for  all  states  s  £  2P  holds  that  computation  of  hi(s)  takes  the 
same  or  longer  time  than  computation  of  h^is). 

With  the  help  of  the  three  definitions,  we  can  define  properties  of  a  pair  of 
heuristics  which  is  required  for  the  search  in  the  MADLA  planner: 

Definition  7.  For  a  multiagent  planning  problem  II,  let  h\_  be  a  heuristic  func¬ 
tion  for  II  and  let  ho  be  a  heuristic  functions  for  the  global  problem  IIG.  The 
heuristics  h\_  and  ho  are  MADLA  heuristic  pair  iff  ho  uses  a  non-blocking  es¬ 
timator  by  Definition  4,  ho  dominates  hi  by  Definition  5,  and  ho  estimator  is 
computationally  harder  then  hi  estimator  by  Definition  6. 

The  intuition  behind  the  properties  is  that  the  search  combines  a  compu¬ 
tationally  fast  heuristics  hi  working  not  necessarily  with  all  the  information 
in  the  problem  therefore  likely  with  worse  estimation  and  a  computationally 
slower  heuristics  ho  working  with  global  information  such  that  the  agent  using 
the  estimator  can  get  additional  information  from  other  agents  (e.g,  in  the  form 
of  increasing  cost  of  some  actions).  Therefore  the  global  heuristic  estimate  is 
higher  than  the  local  estimate.  Such  combination  can  work  smoothly  only  if  ho 
does  not  block  the  agent  using  it,  therefore  the  last  property  required  allows  the 
agents  to  continue  the  search  using  hi  whilst  waiting  for  responses  from  other 
agents  computing  parts  of  ho. 

In  next  sections,  we  describe  the  heuristics  and  the  MADLA  search  in  details. 

3.1.  Heuristics 

The  key  principle  behind  the  MADLA  planner  is  a  favorable  combination 
of  both  local  and  distributed  heuristics.  A  heuristic  pair  complying  with  the 
used  search  in  the  MADLA  planner  combines  a  light  single-agent  and  a  heavy 
distributed  multiagent  heuristics  which  in  this  article  are  both  forms  of  the 
FastForward  (FF)  heuristics  [6].  Here,  we  describe  both  variants  and  show  that 
they  satisfy  the  properties  required  by  Definition  7. 

3.1.1.  Local  Heuristic 

In  this  section,  we  will  introduce  the  local  heuristic  used  in  our  planning 
system.  By  “local”  we  mean  a  heuristic,  which  computes  the  estimates  using 
only  the  respective  agent’s  view  of  the  problem  as  proposed  in  [3]  (also  referred 
to  as  projected  heuristic,  as  the  heuristic  estimate  is  for  the  projected  problem). 

In  MA-STRIPS,  for  agent  a*,  this  means  using  only  its  ct;-internal  and  pub¬ 
lic  facts,  its  actions  and  cq-projections  of  actions  of  other  agents  UajG^{aai  la  e 
Aj } .  A  resulting  local  planning  problem  is  the  multiagent  planning  problem  n 
projected  to  agent  a*. 
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Definition  8.  Let  hn  1  :  2Pi  — >  K+  be  a  local  heuristic  function  for  a  projection 
IT*-  =  (pi,{_]oc,eA{aoli\a  &  Aj},I Pfr0i ,G C\  of  a  multiagent  planning 

problem  II  for  agent  a*. 

Local  FF  Fteuristic.  One  of  the  most  successful  and  most  studied  heuristics  for 
satisficing  planning  is  the  FF  heuristic  [6].  We  use  FF  in  the  MADLA  planner 
as  the  local  heuristic.  The  FF  heuristic  belongs  to  the  delete  relaxation  heuristic 
family. 

The  idea  behind  delete  relaxation  heuristics  is  to  simplify  the  problem  by 
ignoring  negative  effects  of  actions.  In  the  STRIPS  formalism,  this  means 
that  an  action  a  =  (pre(a),  add(a),  del(a))  is  transformed  to  a  relaxed  form 
a+  =  (pre(a),  add(a),  0).  A  set  of  relaxed  actions  denoted  A+  =  {a+|a  €  A)  is 
used  in  definition  of  a  relaxed  planning  problem  II+  =  (P,  A+ ,  J,  G )  respective 
to  the  original  planning  problem  II  =  (P,  A,  I,  G) .  Finding  an  optimal  plan  tt+ 
for  n+  is  NP-Complete  [11],  therefore  it  is  sill  impractical  heuristic  estimation. 
In  order  to  lower  the  complexity  even  more,  approximations  of  n+  are  used  in 
classical  planning.  The  most  commonly  used  approximation  is  a  sub-optimal 
relaxed  plan  (RP).  Finding  a  sub-optinral  RP  can  have  as  low  as  polynomial 
complexity  and  therefore  can  be  fast  enough  in  practice.  The  length  of  RP  is 
used  as  the  heuristic  estimate. 

The  main  idea  behind  the  FF  heuristic  is  to  perform  analysis  of  which  facts 
are  successively  reachable  by  applied  relaxed  actions  (reachability  analysis)  and 
from  this  analysis  to  determine  the  relaxed  plan  in  a  backward  fashion.  The 
principle  uses  a  notion  of  a  supporter  action  a  of  fact  p  which  is  an  action  a  s.t. 
p  G  add(a).  Let  II+  =  (P,  A+,  I,  G)  be  a  relaxed  planning  problem,  then  the 
principle  is  following: 

1.  Initialize  a  set  of  unsupported  facts  U  to  contain  all  goal  facts:  U  <—  G. 

2.  Move  an  unsupported  fact  p  from  U  to  a  set  of  supported  facts  S  and 
determine  its  supporter  a. 

3.  Mark  all  preconditions  of  a  as  unsupported  if  not  supported  already:  U  -S— 
U  U  (pre(a)  \  S). 

4.  Loop  1-3  until  all  facts  in  U  are  either  supported  or  in  the  initial  state: 
until  U  \  S  =  0  V  U  C  I. 

There  are  many  methods  how  to  implement  this  high-level  scheme  (which  differ 
mainly  in  the  way  how  the  supporters  are  chosen)  and  many  methods  how 
to  perform  the  reachability  analysis.  In  our  work,  we  have  used  one  of  the 
most  prevalent,  based  on  an  Exploration  Queue  algorithm  as  implemented  for 
example  in  the  Fast-Downward  planning  system  [9].  The  main  principle  of 
Exploration  Queue  is  that  reachable  facts  are  iteratively  added  to  the  queue 
and  as  the  head  of  the  queue  is  removed,  if  some  action  becomes  applicable  (the 
removed  fact  was  its  last  unsatisfied  precondition  and  not  yet  removed),  the 
action  is  applied  and  its  effects  added  to  the  queue. 

The  local  FF  heuristic  is  used  in  MADLA  on  the  projection  of  the  planning 
problem  II  in  form  of  hn  '  by  Definition  8.  The  relaxation  is  therefore  done 
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over  actions  A,  of  an  agent  a,  such  that  the  problem  is  local  relaxation  formally 


3.1.2.  Distributed  Heuristic 

In  addition  to  a  local  heuristic,  the  MADLA  planner  uses  a  distributed 
multiagent  heuristic  (also  referred  to  as  global  heuristic,  as  the  heuristic  estimate 
is  for  the  global  problem).  In  contrast  to  the  computation  of  a  local  heuristic 
by  a  single  agent,  a  coordinated  computation  by  multiple  agents  is  necessary 
for  evaluation  of  distributed  heuristic  estimates.  Computation  of  such  heuristic 
incurs  a  communication  overhead,  which  is  in  many  cases  outweighed  by  better 
estimations  w.r.t  the  local  form  of  the  heuristic. 

In  MA-STRIPS,  a  distributed  heuristic  is  defined  over  a  multiagent  plan¬ 
ning  problem: 

Definition  9.  Let  hn  :  2P  — >  R+  be  a  distributed  heuristic  function  for  a 
multiagent  planning  problem  II. 

Even  if  the  amount  of  communicated  bytes  is  not  of  a  concern,  the  process 
of  communication  is  multiple  times  slower  than  local  computation  and  thus 
intensive  communication  can  significantly  slow  down  the  speed  of  the  planner. 

This  problem  was  tackled  in  the  literature  by  computing  the  distributed  parts 
upfront  and  caching  the  results  [12],  or  by  lazy  (on-request)  evaluation  [7,  5]. 

In  both  our  recent  heuristics  and  a  new  heuristic  proposed  in  this  article  we  use 
the  latter  approach. 

Distributed  FF  Heuristic.  The  novel  heuristic  in  the  next  section  is  based  on 
our  distributed  lazily  evaluated  form  of  the  FF  heuristic,  which  was  presented 
in  [7]  using  explicit  Relaxed  Planning  Graphs  for  the  reachability  analysis  and 
updated  to  use  the  more  efficient  Exploration  Queue  algorithm  in  [5].  This 
approach,  termed  Lazy  FF  (lazyFF),  distributes  the  process  of  finding  and  ex¬ 
tracting  Relaxed  Plan  from  a  state  s  based  on  the  following  principle: 

1.  The  agent  cq  initiating  the  estimation  locally  computes  a  projected  re¬ 
laxed  plan  7 r“i+  which  is  a  solution  of  an  a^-projection  of  the  relaxed 
problem  IIQi+  (see  Equation  1)  and  initializes  a  resulting  relaxed  plan 
cost  by  the  cost  of  the  computed  RP  excluding  all  projected  actions: 

<a). 

2.  For  each  projected  action  aai  €  7Tai+\Ajl“,  the  initiator  agent  cq  sends  a  re¬ 
quest  to  the  action’s  owner  agent  ctj .  Upon  receiving,  aj  computes  partial 

RP  7r“J+  as  a  solution  of  a  problem  (Pj,  At  U  U akeA^aj  1^  e  }>  sCi  Pj,  pre(a 
and  sends  the  cost  of  Ttai  +  n  A+  to  the  initiator  agent  as  a  reply:  caj 
Eae7r“j+nA+  c(a)- 

3.  The  agent  ay  may  need  to  ask  other  agent(s)  a k  in  the  same  manner, 
resulting  in  a  distributed  recursion  summing  up  the  costs:  caj  caj  +c“fc . 


4.  The  initiator  agent  then  adds  the  received  cost  to  the  resulting  RP  cost: 

ca*  t—  cai  +  caj . 

The  most  severe  drawback  of  Lazy  FF  is  that  although  only  local  actions  of 
agents  are  counted,  significant  over-counting  of  actions  often  appears  due  to  the 
distributed  recursion  and  possibly  repeated  counting  of  actions  already  counted 
by  the  same  agents. 

In  [5],  the  lazy  approach  was  also  applied  to  the  principle  of  Exploration 
Queue  itself.  The  basic  process  of  building  the  distributed  exploration  queue 
is  similar  to  the  centralized  version,  but  whenever  a  projection  of  some  other 
agent’s  action  should  be  applied  (and  its  effect  added  to  the  queue),  a  request 
is  send  to  the  owner  of  the  action  to  obtain  its  true  cost  first.  This  follows  the 
lazy  principle  as  described  in  2nd  step  of  Lazy  FF.  The  effect  of  the  action  is 
added  to  the  distributed  queue  after  the  reply  is  received.  When  an  agent  is 
computing  the  reply,  the  agent  may  need  to  send  requests  as  well,  thus  ending 
up  with  a  distributed  recursion  as  in  3rd  step  of  Lazy  FF.  In  order  to  effectively 
handle  the  recursion  it  is  flattened  so  that  all  requests  are  sent  by  the  initiator 
agent  cq  and  the  replies  are  supplied  with  the  parameters  allowing  the  initiator 
agent  cq  to  request  the  agents  (a.j.  a*,, . . .)  of  the  next  calls.  The  Relaxed  Plan 
is  then  extracted  only  locally.  The  resulting  heuristic,  Recursive  Distributed  FF 
(or  rdFF)  exhibited  better  performance  than  Lazy  FF  using  (local)  Exploration 
Queues  for  the  partial  RP  computation,  most  probably  due  to  the  over-counting 
of  actions  in  Lazy  FF. 

3.1.3.  Set- Additive  Lazy  FF 

In  this  article,  we  introduce  a  new  technique  how  to  handle  the  over-counting 
in  Lazy  FF.  We  take  inspiration  from  the  Set- Additive  variation  of  the  FF 
heuristic  [13],  where  instead  of  cost  of  reaching  a  fact  p  in  a  planning  problem 
II  =  (P,  A,  I ,G) ,  each  fact  p  is  associated  with  a  relaxed  plan  7r+  solving  a 
relaxed  planning  problem  where  p  is  the  goal  G  =  {p} .  The  overall  relaxed  plan 
7r+  is  then  constructed  by  computing  a  set  unions  of  the  respective  fact  relaxed 
plans 

=  U  (2) 

pGP 

which  is  possible  as  the  order  of  the  actions  in  a  relaxed  plan  can  be  arbitrary 
and  using  any  action  more  than  once  is  redundant.  The  new  heuristic  is  termed 
Set- Additive  Lazy  FF  or  SA  Lazy  FF. 

The  estimation  of  SA  Lazy  FF  proceeds  similarly  as  in  Lazy  FF  using  Explo¬ 
ration  Queues.  The  difference  is  that  the  agents  do  not  send  the  intermediate 
costs  caj ,  c“fc , . . .,  but  the  relaxed  plans  Traj+ ,  7rQfc+, . . .  which  are  then  merged 
by  the  initiating  agent  cq.  The  resulting  heuristic  estimate  coti  is  the  cost  of  the 
merged  relaxed  plan.  The  SA  Lazy  FF  estimator  in  the  recursive  form  follows: 

1.  The  agent  cq  initiating  the  estimation  locally  computes  a  projected  relaxed 
plan  7r“i+  which  is  a  solution  of  a  relaxed  projected  problem  II“'+  (see 
Equation  1). 


9 


2.  For  each  projected  action  aai+  €  7r“i+  \  Af ,  the  initiator  agent  an  sends 
a  request  to  the  action’s  owner  agent  aj.  Upon  receiving,  aj  computes 
partial  RP  Traj  as  a  solution  of  a  projected  problem 

n^+  =  / pp,  U  {baj+\b  G  A+},  s  n  Pp,  pre(a+) 

\  akGA 

and  sends  naj  to  the  initiator  agent  as  a  reply. 

3.  The  agent  aj  may  need  to  ask  other  agent(s)  au  in  the  same  manner, 
resulting  in  a  distributed  recursion  merging  the  partial  relaxed  plans: 
7 r“j'+  G-  7 Tak+  U7 r“fc+. 

4.  The  initiator  agent  merges  the  received  relaxed  plan  naj  with  the  initial 
RP  7r“i+  G-  7rQi+  U  Traj+  and  returns  its  cost:  XPtt »i+  c(a)- 

Similarly  to  the  Lazy  FF  estimator,  we  have  to  explicitly  consider  that  in  the 
2nd  step,  no  solution  for  II“-'  +  ,  IIafc+, . . .  may  exist.  Such  situation  indicates 
a  dead-end  on  preconditions  pre(a).  As  it  is  not  cheaply  possible  to  find  out 
if  another  relaxed  plan  7rQi+  could  be  used,  we  ignore  the  information  and 
return  the  heuristic  as  if  the  action  was  reachable  (and  the  replied  relaxed  plan 
7 raJ  +  ,  7r“fc+, . . .  empty).  This  causes  the  heuristic  to  report  non-oo  estimates  for 
dead-end  states  (will  not  be  safe).  Since  the  FF  heuristic  is  not  safe  by  itself  and 
the  practical  efficiency  gains  are  substantial,  we  conclude  it  is  not  a  (practical) 
problem. 

Following  the  same  arguments  as  in  the  case  of  Lazy  FF,  for  implementation 
of  SA  Lazy  FF,  we  have  flattened  the  recursion2  and  used  the  Exploration 
Queues  ending  with  algorithm  implementing  the  Equation  2  using  distributed 
requests  for  7 r+  plans. 

Soundness..  In  the  SA  Lazy  FF  heuristics  we  trade  the  estimation  equality 
with  centralized  FF  for  efficiency  of  the  distributed  computation.  Technically, 
each  agent  may  choose  different  supporter  action  for  a  single  fact,  thus  in  the 
resulting  relaxed  plan  a  single  fact  may  have  multiple  supporting  actions.  This 
unwanted  effect  causing  overcounting  of  actions  can  be  partially  reduced  by  the 
following  modification.  When  an  agent  receives  a  request,  it  computes  relaxed 
plan  only  to  the  private  preconditions  of  the  action  -  the  public  preconditions 
were  already  taken  care  of  by  the  initiator  agent.  Proof  of  soundness  of  the 
resulting  relaxed  plan  7rQi+  follows: 

Proposition  10.  Let  II  be  a  midtiagent  planning  problem,  IIG  the  respective 
global  problem  and  IIG+  =  (P,  A+,  I,  G)  the  respective  relaxed  global  problem.  A 
relaxed  plan  7rQi+  computed  by  agent  cq  using  the  SA  Lazy  FF  estimator  is  a 
sound  solution  o/IIG+. 


2  For  applications  requiring  “real”  private  knowledge  preservation  as  proposed  in  [14],  SA 
Lazy  FF  can  use  hashes  as  a  way  of  obfuscation  (and  also  communication  load  reduction)  for 
the  communicated  private  actions.  It  is  questionable,  whether  some  useful  information  can 
be  extracted  form  the  obfuscated  relaxed  plans,  but  if  so,  this  approach  may  not  be  suitable 
for  applications  requiring  such  degree  of  privacy  protection. 
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Proof.  In  the  SA  Lazy  FF  estimator,  we  start  with  a  projected  relaxed  plan 
7r“i+,  which  is  not  a  valid  plan  for  the  global  relaxed  problem  IIG+,  because 
it  may  contain  projected  actions  which  are  not  part  of  the  global  problem. 
We  can  transform  nai+  by  replacing  all  projected  relaxed  actions  a°i+  by  the 
original  relaxed  actions  a+ .  We  will  denote  the  resulting  relaxed  plan  as  7ra<_>'G. 
Notice  7r“i_>'G  is  still  not  a  valid  plan  for  IIG+,  because  it  may  contain  actions 
which  have  some  preconditions  not  met.  Those  are  private  preconditions  of  the 
originally  projected  actions.  There  are  no  actions  to  support  them. 

The  SA  Lazy  FF  estimator  then  continues  with  sending  requests  for  each 
aai+  £  7 r“i+  \  Aj1".  Neither  the  order  of  actions  or  the  exact  process  of  obtaining 
reply  is  important  here.  The  reply  obtained  from  some  agent  ay  for  action  aai+ 
s.t.  a+  G  A+  is  a  relaxed  plan  7raj+,  such  that  its  application  to  the  global 
initial  state  I  results  in  satisfying  all  private  preconditions  of  a+  (those  are 
pre(a+)  H  facts(ay)).  Of  course,  7r“i+  may  again  contain  projected  actions  of 
agents  other  than  ay  (including  a,;). 

Since  by  application  of  7rQi+\a“i+  to  the  initial  state  I,  all  public  precondi¬ 
tions  of  a+  are  satisfied,  union  of  the  relaxed  plans  Trai,OCj  =  nai  U  Traj  satisfies 
all  preconditions  of  a+.  The  union  may  introduce  new  projected  actions,  but 
as  there  is  a  finite  number  of  actions  in  the  problem,  all  projected  actions  are 
eventually  resolved  by  repeated  application  of  the  request-reply  protocol.  Po¬ 
tentially,  there  may  be  cycles  in  the  partial  relaxed  plans  (e.g.  action  a  £  A; 
achieves  precondition  pt,  of  action  b  £  Aj  and  b  achieves  precondition  pa  of  a) , 
but  as  the  preconditions  pa,Pb  are  shared  among  the  agents  i,j  they  must  be 
public  and  therefore  either  pa  or  p ^  must  already  be  resolved  in  the  RP. 

When  the  protocol  finishes,  all  actions  in  the  resulting  merged  relaxed  plan 
-nA  have  all  their  preconditions  supported.  Since  already  7Tai  achieved  the  goal 
projected  to  agent  on,  that  is  Gai  and  since  global  goal  is  assumed,  that  is 
Gai  =  G,  also  7 ta  achieves  the  global  goal  G  and  is  a  valid  plan  for  the  relaxed 
problem  IIG+.  □ 

MADLA  Search  compliance..  For  the  MADLA  Search,  by  Definition  7  we  need 
two  heuristics  h\_  and  ho-  As  a  local  hi,  we  use  projected  FF  (see  Section  3.1.1) 
and  as  a  distributed  (global)  ho  we  use  SA  Lazy  FF  defined  in  the  previous 
paragraphs.  Now,  we  can  show  the  following: 

Proposition  11.  The  local  FF  heuristic  hi  and  the  global  SA  Lazy  FF  heuristic 
ho  comply  with  the  definition  of  MADLA  heuristic  pair  (Definition  7). 

Proof.  First,  ho  dominates  hi  (Definition  5):  The  Exploration  Queue  algorithm 
is  in  both  cases  the  same,  also  the  initial  RP  extraction  procedure  is  the  same. 
In  addition  SA  Lazy  FF  sends  requests  and  possibly  augments  the  RP  with 
received  actions  of  other  agents.  Therefore  the  length  (and  thus  the  heuristic 
estimate)  of  the  RP  constructed  by  SA  Lazy  FF  is  always  equal  or  longer  than 
the  one  constructed  by  projected  FF. 

Second,  ho  estimator  is  computationally  harder  than  hi  estimator  (Defini¬ 
tion  6):  The  initial  step  of  both  heuristic  evaluations  is  the  same — building  a 
local  reachability  analysis  and  finding  the  local  relaxed  plan.  The  SA  Lazy  FF 
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heuristic  then  continues  with  additional  computation  (also  involving  commu¬ 
nication)  to  obtain  the  global  RP  estimate.  Therefore  the  computation  of  SA 
Lazy  FF  always  takes  the  same  or  longer  time  to  finish,  regardless  the  speed  of 
the  communication. 

Last,  ho  estimator  is  non-blocking  (Definition  4):  When  the  reachability  is 
done  and  local  RP  extracted,  SA  Lazy  FF  sends  requests  to  all  agents  whose 
projected  actions  were  used  in  the  RP.  After  sending  the  requests,  the  agent 
waits  for  the  replies  -  while  waiting,  some  other  asynchronous  computation  can 
be  performed.  The  agent  must  also  compute  replies  to  requests  from  other 
agents,  but  this  does  not  break  the  property.  □ 

To  sum  up  the  situation,  we  have  two  heuristic  estimators — one  less  infor¬ 
mative  but  fast  and  one  more  informative  but  slower.  This  is  quite  common  sit¬ 
uation  in  classical  planning.  In  our  case,  the  slower  heuristic  ho  is  asynchronous 
(that  is  non-blocking) ,  meaning  the  agent  can  perform  other  computation  while 
it  is  being  evaluated.  The  computation,  the  agents  will  perform  meanwhile,  is 
a  local  search  driven  by  estimation  of  local  hi  heuristics.  As  soon  as  the  agent 
receives  the  estimation  from  ho  it  can  progress  with  the  global  search. 

3.2.  Search 

In  order  to  find  a  plan,  the  MADLA  planner  searches  through  a  state-space 
induced  by  the  input  planning  problem,  driven  by  the  proposed  heuristic  pair  of 
local  FF  and  distributed  SA  Lazy  FF.  From  the  literature  [5,  15],  we  know  that 
the  performance  of  local  and  distributed  heuristics  differs  over  various  multia¬ 
gent  planning  problems.  The  search  we  are  proposing  combines  the  local  and 
distributed  heuristics  in  a  manner  inspired  by  classical  multi-heuristic  search  [9] , 
but  modified  to  be  suitable  for  similar  (or  as  in  our  case  the  same)  heuristics  in 
local  and  distributed  forms. 

Since  MADLA  is  a  multiagent  planner,  the  search  is  a  distributed  compu¬ 
tation  spread  over  the  agents.  We  build  on  the  idea  of  distributed  state-space 
search  [3,  5]  and  extend  it  to  a  distributed  multi-lreuristic  search. 

Each  building  block  is  defined  formally  and  instantiated  for  use  with  FF  and 
SA  Lazy  FF  heuristics.  Similarly,  the  merged  multiagent  multi-heuristic  search 
is  defined  generally  and  grounded  into  the  particular  MADLA  Search  with  FF 
and  SA  Lazy  FF.  The  MADLA  search  algorithm  is  then  described  in  detail. 

3.2.1.  Classical  Multi- Heuristic  Search 

Multi-heuristic  search  was  pioneered  by  the  Fast  Downward  planning  system 
[9]  as  a  way  how  to  combine  different  heuristic  estimators  without  the  need  to 
combine  the  heuristic  values,  and  was  also  one  of  the  main  mechanisms  behind 
the  success  of  the  LAMA  planner  [16]. 

Definition  12.  Let  II  =  (P,  A,/,  G)  be  a  planning  problem  and  hi,...,hm 
heuristics  functions  for  II.  A  multi-heuristic  search  is  defined  over  states  s  £ 
2P  and  uses  OPEN  lists  O i, . . . ,  Om  of  states,  a  CLOSED  list  C  of  states,  a 
selection  function 

select(Oi, . .  ..Om)  Ok 
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where  1  <  k  <  m  and  a  state  expansion  function 

expand(s')  n-  {s|s  =  apply(s',  a),  a  €  As.t.  pre(a)  C  s'}. 

A  multi-heuristics  search  systematically  extracts  a  state  s  =  arg  minsgofc  p.(s) 
from  Ok  selected  by  the  select  function,  adds  s  into  C  and  adds  states  expand (s)\ 
C  into  Oi ,  for  all  1  <  i  <  m.  The  search  begins  with  Oi  =  {/}  for  all  1  <  i  <  m 
and  terminates  if  G  C  ,s.  The  result  is  a  sequence  tt  =  (ai, . ■ .  ,ak)  of  actions 
used  in  the  expands  from  I  to  G. 

To  fit  the  heuristics  pair  to  a  multi-heuristic  search,  the  local  FF  h\_  and 
distributed  SA  Lazy  FF  ho  can  be  used  in  the  following  way:  hi  =  hi,  p  =  ho 
and  m  =  2.  Thus,  by  Definition  12,  there  will  be  an  OPEN  list  Oi  =  Oi  for 
states  evaluated  by  the  local  heuristics  and  an  OPEN  list  O2  =  On  for  the 
distributed  heuristics. 

A  choice  of  an  appropriate  selection  function  select  for  classical  planning 
was  thoroughly  examined  in  [10]  with  a  conclusion  that  the  simple  alternation 
mechanism,  where  the  open  lists  are  chosen  in  turn,  appears  to  be  the  best  one 
(this  mechanism  was  used  in  both  FD  and  LAMA). 

3.2.2.  Multiagent  Single-Heuristic  Search 

A  multiagent  variant  of  the  heuristic  search  was  proposed  in  [3,  5]  as  a 
multiagent  optimal  search  and  multiagent  greedy  best-first  search  (MA-BFS) 
respectively.  The  search  in  MADLA  adheres  to  the  latter. 

The  main  principle  of  multiagent  heuristic  search  is  a  straightforward  distri¬ 
bution  of  the  classical  heuristic  search  as  illustrated  in  Figure  1. 

Definition  13.  Let  II  =  (P,  {Aj}A_  1,I,G}  be  a  multiagent  planning  problem, 
h  a  heuristic  function  for  II.  A  multiagent  heuristic  search  is  defined  over  states 
s  €  2P  and  over  each  agent’s  OPEN  list  Oai  of  states,  CLOSED  list  Cai  of 
states,  agent  state  expansion  function 

expand^,  (s)  >->•  {s'ls'  =  apply(s,a),a  €  AjS.t.  pre(a)  C  s}, 
and  a  distribution  function 


distQi  :  2P  x  A  — >  2a. 

In  multiagent  heuristic  search,  each  agent  on  extracts  systematically  and  in 
parallel  a  state  s  =  argmins6oa  h(s)  from  Oai,  adds  s  into  Cai  and  expands 
states  S  =  expand^. (s)\Cai,  where  s'  €  S  are  added  into  OPEN  lists  {Oa.  |ay  € 
distQ. (s, a)},  where  action  a  expanded  s'.  The  search  begins  with  Oa.  =  {/} 
for  all  agents  and  terminates  if  any  agent  finds  s  s.t.  G  C  s.  The  result  is  a 
sequence  tt  =  (ai, . . . ,  ap  of  actions  used  in  the  expands  from  I  to  G  across  the 
agents. 

MA-BFS  [5]  is  a  multiagent  heuristic  search  by  Definition  13.  The  used  state 
exchange  function  dist  is  based  on  MA-STRIPS  partitioning  of  the  agent’s 
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i)  ii)  iii)  iv) 

Agent  a  Agent  a  Agent  a  Agent  a 


Figure  1:  Example  of  a  multiagent  heuristic  search  for  two  agents  a  and  ft.  In  i)  both  agents 
expand  the  initial  state  by  private  actions  (black  arrows),  in  ii)  agent  a  expands  one  state  by 
a  public  action  (green  arrow  and  circle)  and  in  iii)  it  is  sent  to  agent  (3.  In  iv)  agent  a  have 
no  longer  an  applicable  action,  but  agent  (3  expands  the  received  state. 


public  actions  A^ub  and  internal  actions  (as  exemplified  in  Figure  1).  From 
perspective  of  each  agent  a*,  the  MA-BFS  distribution  function  is  defined  as 


dist 


MA-BFS 


(a)  1-4 


a  G  Apub, 
otherwise. 


In  MA-BFS  implementation,  adding  expanded  state  s'  to  other  agents  OPEN 
lists  uses  message  broadcast3  initiated  by  the  agent  a*.  The  broadcast  is  repre¬ 
sented  in  distMA~BFS  by  the  a  £  Apub  case.  The  other  case  describes  expand  by 
an  internal  action  of  the  agent  ctj,  therefore  the  expanded  state  is  added  only 
into  the  OPEN  list  of  the  agent  aj,  following  the  principle  of  classical  heuristic 
search. 


3.2.3.  Multiagent  Multi- Heuristic  Search 

A  merge  of  the  two  previous  search  schemes  combines  a  multi-heuristic  search 
(Definition  13)  and  a  multiagent  search  (Definition  12) 

Definition  14.  Let  II  =  (P,  {Ai}^_1, /,  G)  be  a  multiagent  planning  prob¬ 
lem  and  hi, . . .  ,hm  heuristic  functions  for  IF  A  multiagent  multi-heuristic 


3As  long  as  an  agent  oti  does  not  want  to  share  private  facts  P?  =  {p\p  £  ss.t.  p  £P^nt}  of 
a  broadcasted  state  s,  it  can  obfuscate  the  facts  in  P-s  (as  proposed  in  [14])  or  replace  the  facts 
in  set  P?  by  a  private  unique  identifier  (or  a  hash  value)  known  only  to  aj,  as  this  private 
part  of  the  state  cannot  be  modified  by  other  agents.  If  a  modified  state  amended  by  such 
identifier  returns  by  one  of  the  later  broadcasts  back  to  the  agent  a^,  it  can  use  the  identifier 
and  add  the  private  facts  P?  back  as  if  they  were  always  part  of  the  state. 
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search  is  defined  over  states  s  £  2P  and  over  each  agent’s  a*  OPEN  lists 
. . .  0(Qi>m)  of  states,  CLOSED  list  Cai  of  states,  a  selection  function 

select^ 

a.i,m  >) |— *■  0(Qi>fc) 

where  1  <  k  <  to,  a  state  expansion  function 

expandQ.  (s')  >->•  {s|s  =  apply  (s',  a),  a  £  AjS.t.  pre(a)  C  s'}, 
and  a  distribution  function 

distQ.  :  2P  xi-> 

In  multiagent  multi-heuristic  search,  each  agent  extracts  systematically  and  in 
parallel  a  state  s  =  argmins£o^  hk(s)  from  O (ai,k)  selected  by  the  select,^ 
function,  adds  s  into  Cai  and  expands  S  =  expandQi(s)  \  Ca% ,  where  s'  £  S 
are  added  into  OPEN  lists  {0(a,fc)|  (a,k)  £  distQi (s,  a)},  where  s'  was  expanded 
by  a.  The  search  begins  with  0/atk)  =  {/}  for  all  agents  and  all  OPEN  lists 
and  terminates  if  any  agent  finds  s  s.t.  G  C  s.  The  result  is  a  sequence 
7 r  =  (a i, . . .  ,ai)  of  actions  used  in  the  expands  from  /  to  G. 

As  a  baseline  instantiation  of  the  scheme  we  will  use  the  alteration  of  local 
FF  and  SA  Lazy  FF  OPEN  lists  and  MA-BFS  distribution  function  based  on 
MA-STRIPS  factorization.  Such  approach  can  be  considered  only  a  baseline,  as 
the  efficiency  improvements  of  the  multi- heuristic  search  (as  e.g.,  in  LAMA)  was 
achieved  thanks  to  the  use  of  very  different  and  in  a  sense  “orthogonal”  heuristics 
and  cannot  be  anticipated  when  two  variants  of  the  same  FF  heuristic  are  used. 
However,  the  idea  of  an  OPEN  list  for  each  heuristic  can  be  modified  to  bring 
improvement  in  multi-agent  heuristic  search,  as  we  will  show  in  further  sections. 

This  schema  is  also  a  template  for  the  MADLA  Search  which  is  its  instanti¬ 
ation  with  the  local  FF  and  SA  Lazy  FF  heuristics  and  a  specific  selection  and 
distribution  functions  described  in  the  following  section. 

3.2.4-  MADLA  Search:  Multiagent  Distributed  and  Local  Asynchronous  Search 

The  search  used  in  the  MADLA  planner  is  a  multiagent  multi-lreuristic 
search  driven  by  a  distributed  and  a  local  heuristic.  As  we  explain  in  the  fol¬ 
lowing  paragraphs,  the  MADLA  Search  relies  on  the  properties  of  the  MADLA 
heuristic  pair  described  in  Definition  7,  which  were  shown  to  hold  for  the  local 
FF  heuristic  and  the  SA  Lazy  FF  heuristic  in  Proposition  11. 

Similarly,  as  we  have  shown  for  multi-heuristic  search  by  Definition  13,  the 
search  schema  is  instantiated  such  that  hi  =  h\_,  h 2  =  ho  and  to  =  2,  where  hi  is 
the  local  FF  and  ho  the  SA  Lazy  FF  heuristic.  Following  the  Definition  14,  the 
search  use  two  OPEN  lists  0(a.iL) ,  0(a.]D)  for  each  agent  ctj.  The  selection  func¬ 
tion  prioritizes  expansion  of  states  in  the  OPEN  list  related  to  the  distributed 
heuristic,  if  the  heuristic  is  not  in  the  process  of  computing  an  heuristic  estimate 
(denoted  as  busyo).  In  the  case  of  SA  Lazy  FF  heuristic,  busyo  means  that  the 
heuristic  is  waiting  for  some  replies.  The  MADLA  selection  function: 
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{0(ai,D)  ^busyD  A  Oh(.hQ(ai,D)  7^  0, 
0(ai>L)  -busyD  AO(„.iD)  =0, 

0<Qi>  L)  busyD. 

The  main  loop  of  the  MADLA  Search  is  listed  in  Algorithm  1.  The  algorithm 
follows  the  search  by  Definition  14. 

Unlike  the  classical  multi-heuristic  search  (Section  3.2.1),  in  MADLA  search, 
the  extracted  states  are  not  evaluated  by  both  heuristics,  the  used  heuristic 
evaluator  depends  instead  on  the  state  of  the  SA  Lazy  FF  heuristic  ho-  If 
^busyD  state  is  evaluated  by  hD-  If  busyD,  state  is  evaluated  by  hi,  that  is 
the  distributed  heuristic  is  always  preferred.  This  approach  is  most  reasonable 
if  ho  dominates  hi,  which  holds  for  the  local  FF  and  SA  Lazy  FF  heuristics 
(Proposition  11). 

Additionally,  the  local  heuristic  search  is  performed  only  when  the  dis¬ 
tributed  heuristic  search  is  waiting  for  the  distributed  heuristic  estimation  to 
finish  (see  line  5  in  Algorithm  1).  This  principle  makes  sense  only  if  finishing  a 
estimation  of  ho  takes  longer  than  of  hi  and  if  computation  of  the  ho  estimator 
does  not  block  the  search  process  (inch  hi  estimations).  These  two  requirement 
hold  for  the  local  FF  and  SA  Lazy  FF  by  Proposition  11  as  well. 

Separation  of  the  searches  (Algorithms  2,  3,  and  4)  has  the  benefit  of  using 
two  heuristics  in  parallel,  but  if  some  information  between  the  two  searches 
could  be  shared4,  most  importantly  the  heuristically  best  state  found  so  far, 
it  could  boost  the  efficiency  of  the  planner.  The  direction  0/ai>D)  — >  0(Qi,L) 
is  straightforward,  thanks  to  the  fact  that  ho  dominates  hi-  We  can  add  all 
nodes  evaluated  by  ho  also  to  the  0/au |_)  without  ever  skipping  a  better  state 
evaluated  by  hi  with  a  worse  state  evaluated  by  ho- 
The  state  distribution  function  covers  this  direction: 


dist 


MADLA 

Oti 


(s,  a)  i  y 


{Ax  L 

A  x  D  U  A  x  L 

{(cl,  D) ,  (ctj,  L)} 


a  ^  ^Cb  A  S  €  0(Qi)L}> 

a  €  A(nt  A  s  €  0(ai)L}j 
a  £  A{ub  As  £  0(ai, d)i 
otherwise. 


The  other  direction  0(a;,L>  — s ►  0(ai. d)  is  trickier.  If  we  added  a  node  s 
evaluated  by  hi  to  0(au □>  it  would  skip  many  nodes  which  are  actually  closer 
to  the  goal  only  because  the  local,  less  informative  heuristic,  will  give  a  lower 
estimate.  The  way  at  least  some  information  can  be  shared  in  this  direction 
is  whenever  the  OPEN  list  0/a.to)  becomes  empty  and  the  heuristic  estimator 
ho  is  not  computing  any  heuristic,  a  state  s  is  pulled  from  the  local  OPEN 
list  0(ait l)  and  evaluated  by  the  distributed  heuristic  ho  and  its  successors  are 


4The  two  searches  both  run  on  one  agent,  therefore  the  question  of  privacy  is  irrelevant 
here. 
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Algorithm  1:  MADLA  Search  for  agent  ai  and  multiagent  planning  prob¬ 
lem  II  following  Definition  14  of  a  multiagent  multi-heuristic  search. 


1 

2 

3 

4 

5 

6 

7 

8 


*<— 


Algorithm  MADLA-Search 

0(oLi,D)  {-Tf;  •<—  {/};  C( 

while  solution  not  found  do 

processCommO ;  //  see  Algorithm  4 
if  ^busyo  then 

//  select“ADLA 


0;  //  search  begin 


if  O 


,D)  7^  0  then 
-  argminse0 


hD(s); 


9 

10 


else  if  0(ai,L)  7^  0  then 

L  s  t-  argminse0<a.  L)  hL(s); 


11 

12 


else 

j  continue; 


13 


processState  (s,  true) ; 


14 

15 

16 
17 


while  (0(ait g>  =0  or  busyoj  and  7^  0  do 

s  «-  argminse0,  /iL(s); 

processState  (s,  false) ; 
processCommO;  //  see  Algorithm  4 


is  Procedure  processState  (s,  d) 
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20 
21 
22 


if  s  ^  Cai  then 

CQi  Cai  U  {s};//  close  state 

if  G  C  s  then 

reconstructPlan(s,  0) ;  //  search  end,  see  Algorithm  5 


23 

24 

25 

26 


if  d  =  true  then 

j  expandDistributed(s) ;  //  see  Algorithm  2 
else 

j  expandLocal (s) ;  //  see  Algorithm  3 


Algorithm  2:  expandLocal(s) 

1  Procedure  expandLocal (s) 

2 

hs  hL(s); 

3 

if  s  reached  by  public  action  then 

4 

send  ST ATE(s,  d  =  false); 

5 

E  4—  expand (s) ;  //  expand^ 

6 

//  dist^ADLAtwo 

7 

Ol  «-  0L  U  E] 
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Algorithm  3:  expandDistributed(s) 


1  Procedure  expandDistributed(s) 

2  ^  /iD(s,callback(/i)); 


3 

4 

5 

6 


handleCallback  callback (h) 

hs  i —  h 

if  s  reached  by  public  action  then 
send  ST ATE{s,  d  =  true); 


7 

8 
9 

10 


E  <—  expand (s) ;  //  expand^ 
//  dist^ADLA 
Od  t—  Od  U  E; 

Ol  ^  Ol  U  E; 


Algorithm  4:  processCommQ 

1  Procedure  processCommO 

2 

for 

each  message  Ad  in  message  queue  do 

3 

switch  M  do 

4 

//  dist™ADLA 

5 

case  M  =  STATE(s,  false) 

6 

[_  Oi  <—  0\_  U  {s}; 

7 

case  M  =  STATE(s,  true) 

8 

O □  O d  U  {s}; 

9 

0\_  Ol  U  {s}; 

10 

case  M  =  RECONSTRUCT (s,  P) 

11 

reconstructPlan(s,  P) ; 
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hD  -  Distributed  Expand 
Heuristic  Estimator 


If  hD  not  busy 


hL  -  Local  Expand 

Heuristic  Estimator 


If  hD  busy 


0D  -  Distributed 
Open  List 


0L  -  Local 
Open  List 


Figure  2:  Distributed/Local  Search  -  OPEN  lists  and  heuristic  estimators. 


added  to  both  OPEN  lists,  as  already  described.  This  way,  the  nodes  in  0(a._ D) 
are  evaluated  only  by  Hq,  but  sometimes,  the  best  node  from  0{a,.L>  is  taken. 
The  situation  is  illustrated  in  Figure  2. 

The  implementation  of  the  distribution  in  the  proposed  search  follows  the 
principles  of  MA-BFS,  i.e.,  broadcasts  are  used  to  inform  other  agents  about 
states  reached  by  public  actions.  Additionally,  information  to  which  OPEN  list 
the  state  should  be  added  in  is  included  (whether  it  is  the  local  or  the  distributed 
one) .  We  use  a  deferred  heuristic  evaluation  a  technique  used  in  LAMA  planner 
for  increasing  efficiency  of  planning  with  computationally  heavy  heuristics. 

In  order  to  end  with  one  sequence  of  actions  n  across  all  the  agents  by 
Definition  14,  the  reconstruction  has  to  be  a  distributed  process.  In  Algorithm 
5  we  list  such  algorithm  and  by  that  we  close  the  description  of  the  MADLA 
Search  and  the  MADLA  planner. 

In  the  following  section  we  prove  that  the  resulting  sequence  7r  is  a  sound 
multiagent  plan  solving  the  input  multiagent  planning  problem  II  and  that  the 
MADLA  planner  is  complete. 


Algorithm  5:  reconstructPlan(s,  P) 


1  Procedure  reconstructPlan(s,  P) 

2  ?<-PU  {s} 

3  if  s  =  /  then 

4  solution  found  P; 


5 

6 

7 

8 


if  parent(s)  is  local  then 
j  reconstructPlan(parent(s),P); 
else 

send  RECONSTRUCT^,  P)  to  agent  from  which  s  was 
received 
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3.3.  Soundness  and  Completeness 

Before  presenting  the  proof  itself,  the  necessary  assumptions  will  be  stated: 

Assumption  1.  Liveness  of  the  communication,  i.e,  every  message  will  even¬ 
tually  arrive  to  its  destination. 

Assumption  2.  Distributed  heuristic  is  independent  of  the  search,  i.e,  it  is 
asynchronous  and  the  messages  never  interfere. 

Assumption  3.  Distributed  heuristic  always  terminates,  i.e,  for  each  state, 
distributed  heuristic  will  eventually  return  the  heuristic  estimate. 


Moreover,  in  the  first  part  of  the  proof,  we  will  assume  that  the  search  is  per¬ 
formed  by  a  single  agent  (although  the  heuristic  still  maintains  assumptions 
2,  and  3.)  on  a  classical  STRIPS  problem.  The  single-agent  version  of  the 
algorithm  will  be  denoted  as  SADLA  Search.  There  is  no  communication  in 
SADLA  Search,  therefore  the  processCommO  procedure  is  ignored  as  well  as 
all  send  commands.  Later  on,  the  proof  will  be  extended  to  include  multiple 
agents. 

3.3.1.  Soundness 

To  show  soundness  of  the  SADLA  Search,  we  will  use  the  fact,  that  for 
each  search  state  s  we  know  its  direct  predecessor  parent(s).  Using  procedure 
reconstructPlanO  shown  in  Algorithm  5,  the  sequence  of  states  (so,  Si, ...,  Sfc), 
where  for  each  0  <  i  <  k,  Si-±  =  parent(s,;),  sq  =  I  and  Sk  =  s,  can  always  be 
reconstructed.  The  action  a,;  which  transformed  state  s,_ i  to  state  Si  will  be 
denoted  as  at  =  trans(si_i,  Si). 

Definition  15.  A  path  is  a  sequence  of  states  (s0,  Si, ...,  S&).  A  path  is  valid 
iff  So  =  I  is  the  initial  state  and  for  each  0  <  i  <  k:  Si_ i  =  parent(si)i  and 
at  =  trans(si_i,  Sj)  is  applicable  in  Sj_i.  We  say,  that  Sk  is  the  resulting  state 
and  path(sfc)  is  the  respective  path  (the  sequence  of  states  expanded  during  the 
search).  The  path  is  a  valid  solution,  if  Sk  is  a  goal  state,  that  is  G  C  si,. 

To  prove  the  soundness,  the  following  lemma  will  be  shown  first: 

Lemma  16.  Invariant:  At  any  given  step  of  the  SADLA  Search,  for  any  state 
S|_  G  0\_  and  any  state  sq  G  O d,  path(si_)  and  path(so)  are  valid  paths. 

Proof.  In  the  initial  step,  O \_  =  Od  =  {/},  where  path(/)  =  (/),  which  is  a 
valid  path.  The  only  procedures,  where  new  states  are  added  to  any  of  the  open 
lists  are  expandLocal  (s)  and  expandDistributed(s)  (note,  that  in  SADLA 
Search,  procedure  processCommO  is  ignored).  In  both  procedures,  the  states 
added  are  created  using  expand (s),  which  creates  new  state  s'  for  each  action  a 
applicable  in  s  such  that  a  =  trans(s,  s')  and  s  =  parent(s').  If  we  assume,  that 
path(s)  =  (/, ...,  s)  is  a  valid  path,  then  after  expansion,  path(s')  =  (I,...,s,s') 
is  also  valid  for  each  new  s' .  □ 
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Theorem  17.  When  the  SADLA  Search  terminates  and  returns  a  solution,  it 
is  a  valid  solution. 

Proof.  The  algorithm  terminates  on  line  4  of  Algorithm  5.  Since  we  assume  a 
single  agent,  the  procedure  is  a  simple  recursion  and  thanks  to  the  condition  on 
line  3  of  Algorithm  5,  upon  termination,  the  last  state  added  to  the  returned 
solution  P  is  the  initial  state  I.  Procedure  reconstructPlanO  is  initially  called 
only  from  line  22  of  Algorithm  1,  where  s  was  extracted  from  O |_  or  Od-  From 
Lemma  16  and  because  a  state  s'  is  added  to  P  only  if  s  =  parent(s')  follows 
that  P  is  a  valid  path,  i.e  P  =  path(s).  Also,  on  line  22  of  Algorithm  1  always 
holds  that  G  C  s,  therefore  P  =  path(s)  is  a  valid  solution.  □ 

3.3.2.  Completeness 

To  show  completeness,  the  algorithm  will  be  further  modified,  in  order  to 
show,  that  any  reachable  state  can  be  reached  by  the  algorithm.  The  modified 
algorithm  will  be  denoted  SADLA+  Search  in  which  the  reconstructPlanO 
procedure  in  Algorithm  5  will  be  ignored  and  the  algorithm  will  be  terminated 
when  both  0\_  and  O □  are  empty  instead. 

Lemma  18.  Each  state  is  added  to  Op  and  to  O \_  at  most  finite  times. 

Proof.  Because  the  number  of  possible  states  is  finite  (2lpl  since  s  C  P)  and  the 
number  of  actions  is  also  finite,  each  call  to  expand  ()  produces  a  finite  number 
of  states  consequently  added  to  O □,  0\_  or  both  (line  5  of  Algorithm  2  and  line 
7  of  Algorithm  3).  If  a  sate  is  extracted  from  0\_  or  Oq,  it  is  added  to  the  closed 
list  C  and  never  added  to  any  of  the  open  lists  again. 

□ 


Lemma  19.  Each  state  in  O i_  and  each  state  in  O □  is  eventually  extracted. 

Proof.  In  each  step  of  the  outer  while  cycle  of  Algorithm  1,  a  state  is  extracted 
from  O d,  if  ho  is  not  busy.  Since  ho  always  terminates  (Assumption  3),  the 
number  of  extracted  states  is  not  limited.  From  Lemma  18  follows,  that  only  a 
finite  number  of  states  may  be  added  to  Od,  therefore  Od  eventually  becomes 
empty.  When  O □  is  empty,  in  each  step  of  the  inner  while  cycle  a  state  is 
extracted  from  0\_.  Following  the  same  reasoning  as  before.  O \_  eventually 
becomes  empty  as  well.  □ 

Theorem  20.  The  SADLA+  Search  terminates. 

Proof.  Follows  directly  from  Lemma  18  and  Lemma  19.  □ 

Definition  21.  A  state  s  is  reachable,  if  valid  path  path(s)  =  (/, ...,  s )  exists. 

Lemma  22.  If  a  state  s  is  reachable,  it  is  placed  into  the  closed  list  C  after  a 
finite  number  of  steps. 
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Proof.  Assume  that  s  is  reachable,  but  is  never  placed  in  C.  Since  s  is  reachable, 
then  path(s)  =  (I,  Let  s*  be  the  first  state  in  path(s),  that  is  not  added 

to  C.  Note,  that  there  exist  an  action  a,  such  that  s,  =  apply(a,  Sj_i).  Since 
at  some  point.  s,_i  £  C,  Si-i  must  have  been  taken  from  O \_  or  O At  that 
point.  Si- 1  was  also  expanded  and  because  a  is  applicable  in  Sj_i  it  must  have 
been  applied.  The  resulting  state  s.t  =  apply(a,  Sj_i)  was  added  to  either  0\_ 
or  O d  and  because  of  Lemma  19  eventually  extracted  and  added  to  C.  This 
contradicts  the  assumption  that  s,;  is  not  added  to  C.  □ 

Theorem  23.  The  SADLA  Search  is  complete. 

Proof.  In  SADLA+  search,  every  reachable  state  s  such  that  g  C  G  is  eventually 
placed  into  C  (Lemma  22).  Let  s0  be  first  such  state.  In  SADLA  Search,  after 
adding  it  to  the  closed  list  C,  it  is  given  to  the  Procedure  5  and  path(s0)  is 
reported  as  a  solution.  □ 

3.3.3.  Extending  the  Proof  to  Multiple  Agents 

The  soundness  and  completeness  shown  for  the  SADLA  Search  holds  also  for 
the  MADLA  Search,  that  is,  communicating  states  reached  by  a  public  action 
does  not  break  the  proven  lemmas  and  theorems.  Structures  belonging  to  agent 
a,  will  be  denoted  by  a  superscript. 

Lemma  24.  Sending  state  s  from  agent  ct;  to  agent  ay  for  any  1  <  i,  j  <  n 
does  not  violate  the  invariant  of  Lemma  1 6. 

Proof.  If  path(s)  is  valid  before  sending  state  s  by  agent  a,,  it  is  also  valid  after 
receiving  state  s  by  agent  ay  (no  state  is  added  or  removed).  When  state  s  is 
sent  by  a^,  the  invariant  holds,  because  one  of  the  following  holds: 

(i)  path(s)  contains  no  state  previously  received  from  another  agent. 
Before  sending  s,  s'  =  parent(s)  was  extracted  from  either  0“  or 
Op,  therefore  from  Lemma  16  follows  that  invariant  holds  for  s'.  By 
expanding  a  state,  the  invariant  is  not  violated,  therefore  invariant 
holds  for  s  as  well. 

(ii)  path(s)  contains  some  states  received  from  other  agents.  Let  s'  be 
first  such  state.  Because  of  (i)  and  because  sending  does  not  violate 
the  invariant,  the  invariant  holds  also  for  s. 


□ 

Lemma  25.  Procedure  reconstructPlanO  of  Algorithm  5  reconstructs  a  valid 
plan  even  if  multiple  agents  are  involved. 

Proof.  In  each  call  of  the  procedure,  either  parent(s)  is  added  (if  it  is  local),  or 
the  RECONSTRUCT^,  P)  message  is  sent.  When  received  (each  message  is 
eventually  delivered  thanks  to  Assumption  1),  the  reconstructPlanO  proce¬ 
dure  is  called  again  by  the  receiving  agent.  Since  path(s)  is  valid,  the  process 
will  eventually  reach  the  initial  state  I  and  terminate.  □ 
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Theorem  26.  The  MADLA  Search  is  sound. 

Proof.  Lemma  16  holds  for  MADLA  Search  because  of  Lemma  24.  Soundness 
of  MADLA  Search  follows  from  proof  of  soundness  of  SADLA  Search  (Theorem 
17)  and  Lemma  25.  □ 

Lemma  27.  Each  state  is  received  by  any  agent  at  most  finite  times. 

Proof.  A  state  s  is  sent  by  agent  ay ,  only  if  another  state  s'  was  extracted  from 
either  Off  or  Of3  and  a  public  action  a  £  Ap“h  was  applied.  Since  there  is 
a  finite  number  of  public  actions  and  a  finite  number  of  agents,  each  action  is 
applicable  only  in  finite  number  of  states  (which  are  then  placed  into  C  and 
never  expanded  again),  therefore  state  s  can  be  sent  and  received  only  finite 
number  of  times.  □ 

Let  MADL A+  Search  denote  a  modification  of  MADLA  Search  such  that  the 
reconstructPlanO  procedure  in  Algorithm  5  will  be  ignored  and  the  algorithm 
will  be  terminated  when  Off  and  Off  for  all  on  £  A  are  empty. 

Lemma  28.  The  MADLA+  Search  terminates. 

Proof.  Lemma  18  holds  also  for  MADLA+  Search,  because  the  only  additional 
situation  in  which  a  state  is  added  to  any  open  list  is  when  it  is  received  from 
another  agent,  which  can  happen  at  most  finite  times  (Lemma  27).  Therefore 
also  Lemma  19  holds  and  the  termination  follows  directly.  □ 

In  order  to  continue  with  the  proof  of  completeness,  we  need  to  redefine  the 
definition  of  reachability  for  multiple  agents. 

Definition  29.  A  state  s  is  reachable  by  a  sequence  of  agents  A  =  (ao>  •  ••,  «m) 
of  length  m  (agents  in  A  can  repeat)  if  a  sequence  of  states  ft  =  (sg°, ...,  sffsff+1,  ...s 
exists  such  that  Sg°  is  the  initial  state  I  and  was  reached  by  agent  op,  sfm  =  s 
and  was  reached  by  agent  am  and  7r  is  a  valid  path,  i.e  ft  =  path(s). 

Lemma  30.  If  a  state  s  is  reachable  by  sequence  of  agents  A  =  (a,...,ak)  of 
length  k,  it  is  placed  into  the  closed  list  C  after  a  finite  number  of  steps. 

Proof.  If  a  state  s  is  reachable  by  a  sequence  containing  single  agent  A  =  (no), 
the  lemma  holds  trivially.  Let  us  now  assume,  that  for  all  k  <  m  if  a  state  is 
reachable  by  a  sequence  of  agents  A  =  (a,  ...,«&)  of  length  k,  it  is  added  to  C 
after  finite  many  steps.  We  will  show,  that  the  same  holds  if  a  state  is  reachable 
by  sequence  of  agents  A!  =  («oi  am,  am+i)  of  length  m  +  1.  We  will  show 
the  induction  step  by  a  contradiction.  For  the  contradiction  let  us  assume,  that 
s  is  a  state  reachable  by  sequence  of  agents  A!  of  length  m+  1,  but  never  added 
to  C.  Let  path(s)  =  (so,...,s)  and  let  Si  be  the  first  state  that  is  never  added 
to  C.  One  of  the  following  holds: 

(i)  Si  is  reachable  by  agent  am+ 1.  If  so,  the  same  reasoning  used  in  the 

proof  of  Lemma  22  can  be  used  to  obtain  a  contradiction. 
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Si  is  reachable  by  some  k  agents  A  =  (an ,...,oik)  then  we  have  a 
contradiction  with  the  assumption  of  the  induction. 


(ii) 


□ 


Theorem  31.  The  MADLA  Search  is  complete. 

Proof.  Similarly  to  the  single-agent  version,  in  MADLA+  search,  every  reach¬ 
able  state  s  such  that  s  C  G  is  eventually  placed  into  C  (Lemma  30).  Let  so 
be  first  such  state.  In  MADLA  Search,  after  adding  it  to  the  closed  list  C,  it 
is  given  to  the  reconstructPlanO  procedure  which  reports  a  valid  solution 
P  =  path(so),  even  if  multiple  agents  are  involved  (Lemma  25).  □ 

4.  Evaluation  and  Discussion 

After  the  formal  verification  of  the  proposed  MADLA  search,  we  analyze  its 
practical  properties  in  experimental  evaluation.  First,  we  compare  the  building 
blocks  of  the  planner,  namely  projected  FF  and  distributed  SA  Lazy  FF  heuris¬ 
tics  in  multiagent  single-heuristic  search,  the  pair  of  the  heuristics  in  multiagent 
multi-heuristic  search,  and  finally  the  MADLA  search.  Second,  we  discuss  in 
detail  properties  of  the  planner  on  particular  planning  domains  and  on  vari¬ 
ous  metrics.  Finally,  we  compare  the  planner  with  other  distributed  multiagent 
planners  on  a  multiagent  benchmark  set. 

In  the  classical  planning  literature,  the  typical  way  how  to  compare  plan¬ 
ners  or  heuristics  is  to  run  them  on  a  set  of  benchmark  domains,  each  having  a 
number  of  problem  instances.  The  domains  (or  problems)  vary  in  many  prop¬ 
erties,  such  as  size,  combinatorial  hardness,  structural  traits  and  others.  The 
same  methodology  was  adopted  in  multiagent  planning.  A  notable  difference  is 
in  the  set  of  benchmark  domains,  which  is  much  less  “standardized”  in  multi¬ 
agent  planning  than  in  classical  planning,  as  there  is  no  track  at  International 
Planning  Competition  (IPC)  [17]  for  multiagent  planning  yet.  Therefore  the  set 
of  benchmarks  we  are  using  in  this  article  is  a  union  of  domains  used  for  com¬ 
parison  of  the  four  planners  we  are  comparing  MADLA  to  in  the  last  section. 
All  these  domains  and  problems  originate  in  single-agent  domains  and  problems 
from  IPC.  Since  the  IPC  domains  are  designed  to  both  examine  efficiency  of 
the  planners  against  various  (theoretically)  interesting  properties  and  relate  to 
real  world  problems,  the  same  is  anticipated  in  the  case  of  multiagent  planners. 
Currently  there  are  not  many  properties  specific  only  to  the  multiagent  planners 
(as  most  of  the  multiagent  planners  are  variations  on  distribution  of  principles 
used  in  classical  planning),  therefore  no  special  multiagent  domains  and  prob¬ 
lems  are  usually  used.  The  most  notable  property  studied  already  by  [2]  is 
coupling  of  the  agents  in  the  problems  which  correlates  (to  some  extent)  with 
ratio  of  private  and  public  actions  in  the  planning  problems.  In  the  detailed 
analysis  section,  we  will  discuss  influence  of  this  and  other  more  subtle  traits  of 
the  problems  influencing  efficiency  of  used  search  and  heuristics. 

As  the  core  comparison  metrics,  we  will  use  coverage  of  solved  problems 
under  20  minutes  with  8GB  memory  limit.  In  the  detailed  analysis,  we  will  use 
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number  of  expanded  states  (and  their  ratios  for  the  used  heuristics),  computa¬ 
tion  time  of  the  particular  heuristics  and  number  of  public  actions  requiring  a 
(private)  action  supporter  for  a  private  precondition  fact. 

The  MADLA  planner  is  a  distributed  multiagent  system,  therefore  the  non¬ 
determinism  of  the  planning  process  is  inherent  and  impossible  to  be  synchro¬ 
nized  under  the  assumption  of  unknown  ordering  in  message  delivery  from  two 
different  agents  to  one  recipient.  Considering  this  assumption,  every  measure¬ 
ment  was  repeated  10  times5  and  the  results  were  averaged.  The  experiments 
run  on  a  homogeneous  cluster  of  machines  each  dedicated  for  one  run  of  the 
planner.  Each  machine  was  equipped  with  8  hyper-threading  i7  cores  (i.e.  16 
threads)  at  2.6GHz.  Each  agent  was  running  on  two  threads.  One  receiving 
messages  and  filling  in  content  data  structures  in  appropriate  collections  (e.g., 
states  into  the  OPEN  lists  0(ai.[_),  0(a,,D)  and  the  other  searching  and  evaluat¬ 
ing  the  heuristics. 

f.l.  Comparison  of  the  Building  Blocks 

The  building  blocks  of  the  MADLA  planner  are  the  projected  and  distributed 
FF  heuristics  and  the  scheme  how  to  combine  these  in  search.  A  baseline  ap¬ 
proach  (as  we  presented  in  Section  3.2.1)  is  to  adapt  classical  multi-heuristic 
(MH)  search  for  multiagent  planning.  However,  this  approach  is  not  viable  as 
the  heuristics  are  not  “orthogonal”.  The  proposed  MADLA  search  utilizes  the 
requirement  for  a  non-blocking  distributed  heuristic  estimator  (particularly  im¬ 
plemented  in  the  form  of  the  SA  Lazy  FF  heuristic)  by  running  projected  FF  in 
the  spare  time.  The  Table  1  summarizes  the  coverage  of  the  FF  and  SA  Lazy 
FF  heuristics  in  separate  multiagent  single- heuristic  search  (Section  3.2.2),  a 
multiagent  multi-heuristic  search  (Section  3.2.3)  and  the  MADLA  search  (Sec¬ 
tion  3.2.4)  with  the  projected  FF  and  distributed  SA  Lazy  FF  heuristics. 

The  results  clearly  indicate  that  the  multi-heuristic  approach  is  not  suitable 
for  the  pair  of  projected  and  distributed  FF  heuristics.  The  summed  up  coverage 
results  are  similar  for  the  single-heuristic  searches  following  the  results  in  our 
previous  work  [5].  The  price  for  better  estimates  by  the  distributed  variant  of 
global  FF  is  slower  computation  of  the  estimation.  Driverlog,  elevators08,  wood- 
working08  and  zenotravel  are  domains  where  one  of  the  heuristics  is  performing 
better.  In  cases  where  the  FF  heuristics  is  not  appropriate  or  the  overhead  of  its 
distributed  computation  overweights  the  fact  the  heuristics  is  more  informative, 
the  search  using  only  projected  FF  performs  better. 

In  MADLA  search  the  distributed  search  is  prioritized  over  the  local  pro¬ 
jected  search.  Provided  that  the  implementation  is  ideal,  the  coverage  of  the 
MADLA  search  should  be  always  equal  or  better  than  the  results  of  a  single¬ 
heuristic  search  with  the  distributed  SA  Lazy  FF  heuristic.  As  the  results 


5  Although  10  samples  is  not  enough  for  a  reasonable  statistical  confidence,  it  helps  to 
identify  cases  with  extreme  variance.  Such  phenomenon  did  not  manifest  in  any  of  our  ex¬ 
periments,  therefore  we  concluded  the  planner  is  deterministic  enough  for  the  used  metrics. 
Deeper  analysis  of  statistical  properties  of  a  multiagent  (multi-heuristic)  search  is  an  interest¬ 
ing  topic  left  for  future  work. 
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domain 

\A\ 

FF 

SA  Lazy  FF 

MH  search 

MADLA  search 

blocksworld  (35) 

4 

31.2 

32.8 

15 

34.2 

depot  (20) 

5-12 

10.6 

9.2 

7 

12.4 

driverlog  (20) 

2-8 

17.3 

14 

13 

15.8 

elevators08  (30) 

4-5 

17.3 

28.1 

2 

29.7 

logisticsOO  (20) 

3-7 

19.9 

20 

3 

20 

openstacks  (30) 

2 

17.2 

17.1 

15.5 

17.5 

rovers  (20) 

1-8 

20 

19.9 

6.6 

20 

satellites  (20) 

1-5 

20 

20 

6 

20 

woodworking08  (30) 

7 

1.8 

5 

5.6 

4.5 

zenotravel  (20) 

1-5 

19 

14.1 

8 

19 

total  (245) 

174.3 

180.2 

81.7 

192.9 

Table  1:  Coverage  of  the  building  blocks.  Number  of  problems  in  a  domain  are  in  the  brackets, 
.4  denotes  the  number  (interval)  of  agents  in  the  problems. 


show,  this  property  holds  with  one  notable  exception:  the  woodworking08  do¬ 
main.  Since  the  current  implementation  does  not  interrupt  the  local  estimation 
process  of  projected  FF,  the  distributed  estimator  can  be  blocked  for  a  time  pro¬ 
portional  to  the  hardness  of  the  computation  of  the  local  FF.  This  phenomenon 
exhibits  in  (computationally  hard)  woodworking08  problems  and  imply  degra¬ 
dation  even  under  the  result  of  single-heuristic  SA  Lazy  FF  search.  This  is  not 
the  case  for  MH,  as  it  estimated  all  states  with  both  heuristics  and  alternates 
which  OPEN  list  is  used  for  expansion. 

The  only  other  case  where  the  MADLA  search  is  not  over-performing  both 
single- heuristic  searches  is  the  case  of  driverlog.  In  this  case  the  explanation  is 
straightforward.  The  prioritized  SA  Lazy  FF  heuristics  is  not  appropriate  for 
the  particular  problem  and  the  projected  version  is  faster  because  of  its  relative 
computational  ease  even  though  it  is  less  informed. 

With  exception  of  woodworking08  and  driverlog,  the  proposed  search  scheme 
always  improves  the  coverage  over  both  single-heuristic  searches  and  doubles  the 
performance  of  the  classical  multi-heuristic  scheme  with  the  same  heuristics. 

4-2.  Detailed  Analysis  in  Selected  Domains 

In  this  section,  we  analyze  the  performance  of  presented  MADLA  Search  in 
detail.  The  Table  2  shows  a  comparison  of  various  metrics  measured  using  a 
standard  Greedy  Best-First  Search  (GBFS)  with  projected  FF,  distributed  SA 
Lazy  FF  and  MADLA  search  using  both.  The  first  two  columns  are  ratios  of 
coverage  and  the  number  of  expanded  states  of  GBFS  with  projected  FF  and 
distributed  FF  respectively.  The  next  two  columns  shows  the  percentage  of 
states  in  MADLA  search  expanded  using  projected  FF  and  the  ratio  of  states 
expanded  using  projected  FF  and  distributed  FF.  The  next  three  columns  shows 
the  time  per  state  (in  milliseconds),  the  MADLA  search  spends  on  computing 
the  projected  and  distributed  heuristics  and  the  distributed/projected  ratio. 
The  last  two  columns  show  the  average  percentage  of  public  and  privately- 
dependent  (PD)  actions  in  the  domain.  We  will  not  introduce  the  concept  of  PD 
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projFF 

salFF 

MADLA  exp 

MADLA  th/s 

action  ratio 

domain 

cvg 

exp 

projFF 

all 

projFF 

salFF 

projFF 

salFF 

salFF 

projFF 

public 

all 

PD 

all 

blocksworld 

0.95 

3.9 

82 

4.6 

0.31 

0.77 

2.5 

100 

100 

depot 

1.15 

0.8 

92 

11.5 

0.65 

4.76 

7.3 

95.7 

23 

driverlog 

1.24 

2.1 

72 

2.6 

0.62 

2.17 

3.5 

91.9 

26.7 

elevators08 

0.62 

42.6 

81 

4.3 

0.36 

0.57 

1.6 

66 

66 

logisticsOO 

1.00 

29.7 

89 

8.1 

0.16 

0.63 

3.9 

67.4 

33.7 

openstacks 

1.01 

1.6 

66 

1.9 

3.04 

11.12 

3.7 

100 

0 

rovers 

1.01 

3.1 

71 

2.4 

0.23 

0.81 

3.5 

26.1 

11.5 

satellites 

1.00 

0.7 

68 

2.1 

0.22 

0.63 

2.9 

7.2 

2.7 

woodwork. 

0.36 

11.9 

65 

1.9 

1.07 

2.51 

2.3 

99.9 

13 

zenotravel 

1.35 

0.2 

69 

2.2 

0.53 

0.58 

1.1 

20.7 

14.1 

Table  2:  Comparison  of  various  metrics  (coverage,  expanded  states,  heuristic  computation 
time,  public  and  privately-dependent  action  ratios)  using  projected  FF  (projFF)  and  SA 
Lazy  FF  (salFF). 


actions  formally,  but  intuitively  those  can  be  seen  as  actions  which  are  public, 
but  have  some  private  preconditions — this  means,  there  is  some  dependence  or 
causality  hidden  for  agents  other  than  the  owner  of  the  action. 


A  more  detailed  view  on  comparison  of  the  projected  and  distributed  FF 
heuristic  in  GBFS  is  presented  in  Figure  3.  Left  are  the  heuristic  values  for 
the  initial  state  of  all  problems  in  selected  domains  for  which  the  value  was 
computed.  It  is  clear  that  for  most  of  the  domains,  as  the  complexity  of  the 
problem  grows,  also  the  difference  between  the  distributed  and  projected  heuris¬ 
tic  grows  (note,  this  does  not  say  anything  about  the  heuristic  quality).  Right 
is  the  number  of  expanded  states  (restricted  to  problems  which  were  solved  by 
at  least  one  of  the  heuristics. 

Together  this  two  plots  show  some  interesting  properties.  First,  the  eleva- 
tors08  domain  is  an  example  of  domain,  where  the  distributed  heuristic  gives 
much  larger  heuristic  estimates,  which  also  seems  to  be  significantly  more  in¬ 
formed,  as  suggested  by  the  number  of  expanded  states.  As  the  heuristic  differ¬ 
ence  grows,  also  the  difference  of  the  number  of  expanded  states  grows  in  favor 
of  the  distributed  heuristic.  Similar  behavior,  only  not  as  prominent,  can  be 
observed  in  the  blocksworld  domain.  A  completely  different  picture  paints  the 
depot  domain,  where  the  distributed  heuristic  also  gives  significantly  larger  esti¬ 
mates,  but  as  shown  in  the  expanded  states  plot,  the  heuristic  guidance  degrades 
and  for  larger  problems,  the  projected  heuristic  is  better  for  the  search.  The 
driverlog  domain  also  fits  into  this  category,  where  the  lager  distributed  heuristic 
estimates  does  not  necessarily  lead  the  search  better.  On  the  other  hand,  in  the 
woodworking08  domain,  we  can  observe  that,  although  the  heuristic  estimates 
are  pretty  much  the  same  for  both  heuristics,  the  number  of  states  expanded 
by  the  projected  heuristic  grows  in  comparison  with  the  distributed  FF,  which 
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suggests  that  even  slight  differences  in  the  heuristic  may  have  significant  impact 
on  the  heuristic  quality  and  its  ability  to  lead  the  GBFS. 

Now,  we  analyze  the  results  shown  in  the  Table  2  for  each  of  the  domains  in 
detail. 

blocksworld  This  domain  is  the  same  as  the  classical  blocksworld  domain  ex¬ 
cept  for  having  multiple  hands  as  agents,  the  holding  and  free  facts  being 
private.  Each  agent  can  solve  the  problem  on  its  own,  but  for  the  projected 
heuristic  it  seems  that  the  solution  by  other  agents  is  cheaper.  This  is  also 
suggested  by  that  all  actions  in  the  domain  are  privately-dependent  (PD), 
this  means  there  is  some  causality  not  known  to  the  projected  heuris¬ 
tic.  This  results  in  better  heuristic  guidance  of  the  distributed  heuristic, 
best-first  search  expanding  almost  4x  more  states  with  projected  heuris¬ 
tic  than  with  the  distributed  one  (column  exp  in  Table  2).  In  MADLA 
search,  there  is  about  20%  of  states  expanded  using  distributed  heuristic 
(column  MADLA  exp  left),  which  seems  to  be  enough  to  utilize  the  better 
heuristic  guidance,  but  also  with  80%  of  states  expanded  locally  utilizing 
the  2.5x  speedup  per  state  (column  MADLA  exp  left  and  MADLA  th/s 
right),  thus  solving  even  more  problems  than  GBFS  with  distributed  FF. 

depot  In  depot,  the  trucks,  depots,  and  distributors  are  agents.  Most  of  the 
actions  are  public,  nearly  quarter  of  them  have  private  preconditions  (PD 
actions),  but  those  are  actions  which  operate  with  non-shared  locations, 
that  is  typically  initial  states,  which  diminishes  the  effect  of  PD  actions. 
Completely  private  are  only  drive-truck  actions.  The  distributed  heuristic 
takes  approx.  7x  longer  to  evaluate  in  average  (column  MADLA  th/s  right 
in  Table  2),  but  on  its  own,  the  heuristic  guidance  is  rather  poor  (GBFS 
with  distributed  heuristic  expanding  over  20%  more  states,  column  exp). 
This  leads  to  worse  performance  of  search  guided  only  by  the  distributed 
FF  heuristic.  In  MADLA,  due  to  the  time  demanding  distributed  heuris¬ 
tic  computation,  about  92%  of  states  is  expanded  using  the  projected 
heuristic  (column  MADLA  exp  left).  Nevertheless,  the  small  number  of 
states  expanded  using  the  distributed  heuristic  improves  the  result  even 
in  comparison  with  GBFS  using  only  the  projected  FF. 

driver  log  In  driverlog,  the  drivers  are  agents,  their  locations,  walk  action  and 
the  fact  that  a  driver  is  driving  a  truck  are  private,  everything  else  is 
public.  Most  of  the  problems  can  be  solved  by  a  single  agent.  The  dis¬ 
tributed  heuristic  seems  to  lead  the  search  slightly  better  (approx.  2x 
less  expanded  states,  column  exp  in  Table  2),  but  takes  approx.  3.5x 
more  time  per  state  which  results  in  significantly  worse  coverage  (column 
MADLA  th/s  right,  column  cvg  and  Table  1).  In  MADLA,  this  is  partially 
improved  by  approx  70%  of  states  expanded  using  the  projection  (column 
MADLA  exp  left),  but  it  is  not  enough  to  reach  the  coverage  score  of  the 
projected  heuristic  on  its  own. 

elevators08  In  the  elevators  domain,  the  elevators  are  agents,  locations  of 
passengers  are  public  only  if  shared  among  multiple  elevators  (changing 
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Figure  3:  Heuristic  values  for  initial  states  and  number  of  expanded  states  on  selected  domains. 
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floors).  Public  actions  are  only  those  board  and  leave  actions  involving  a 
shared  floor.  Most  of  the  problems  can  be  solved  by  a  subset  of  agents 
(typically  the  slow  elevators).  All  public  actions  have  private  precondi¬ 
tions  on  the  state  of  the  lift,  its  capacity,  etc  (they  are  privetaly-dependent) 
which  makes  the  distributed  heuristic  dramatically  more  accurate.  The 
GBFS  with  a  projected  heuristic  expands  over  40 x  more  states  than  with 
the  distributed  one  (column  exp  in  Table  2)  and  the  distributed  heuristic 
takes  in  average  only  approx.  1.5  x  longer  to  compute  (column  MADLA 
th/s  right).  This  results  in  over  10  problem  difference  in  coverage  in  favor 
of  the  distributed  heuristic  (Table  1).  In  MADLA,  about  20%  of  states 
is  expanded  using  the  global  heuristic  (column  MADLA  exp  left)  which 
is  enough  not  only  to  match  the  performance  of  GBFS  with  distributed 
heuristic,  but  to  improve  it  slightly. 

woodworking08  In  the  woodworking  domain,  each  tool  is  an  agent.  All  facts 
and  actions  in  this  domain  are  public,  except  for  the  fact  stating  that  a 
high-speed  saw  is  empty  or  loaded.  Subsequently,  loading  and  unloading 
the  high-speed  saw  are  the  only  public  actions  with  private  preconditions 
(privately-dependent).  In  GBFS  the  distributed  heuristic  expands  approx. 
12x  less  states  and  takes  2.3x  longer  to  evaluate  per  state  (columns  exp 
and  MADLA  th/s  right  in  Table  2),  which  result  in  solving  more  than  twice 
as  many  problems  using  the  distributed  heuristic.  In  MADLA,  however, 
the  result  is  slightly  worse.  This  is  because  the  problem  itself  is  hard 
and  even  the  projected  heuristic  takes  a  substantial  time  to  compute,  as 
more  than  60%  of  states  is  expanded  using  projected  heuristic  (column 
MADLA  exp  left),  which  does  not  provide  as  good  heuristic  guidance,  the 
whole  process  is  hindered  and  less  problems  solved.  Note,  that  as  the 
experiments  are  evaluated  statistically,  the  difference  of  0.5  problems  is 
not  significant  and  may  as  well  be  a  result  of  the  non-determinism  in  the 
computation. 

Concerning  other  domains,  zenotravel  (agents  as  planes  are  transporting  passen¬ 
gers)  is  much  similar  to  driverlog,  except  the  heuristic  guidance  of  distributed 
heuristic  is  significantly  worse  than  that  of  projected  heuristic,  but  the  dis¬ 
tributed  heuristic  takes  nearly  the  same  amount  of  time  to  compute,  which  in 
MADLA  eliminates  its  negative  effect.  Based  on  the  measured  values  and  also 
on  the  understanding  of  the  domain,  logisticsOO  domain  (trucks  and  planes  as 
agents  transporting  packages)  is  similar  to  the  elevators08  domain,  although 
much  easier  to  solve.  The  rovers  and  satellites  domains  are  very  loosely  coupled 
domains  with  only  a  small  portion  of  public  actions,  and  are  also  easy  to  solve. 
On  the  other  hand,  in  openstacks  all  the  information  is  public  (agents  are  man¬ 
ager  and  manufacturer)  resulting  in  little  difference  between  the  projected  and 
distributed  heuristic  (except  for  the  time  necessary  to  evaluate  a  state). 

In  summary,  the  distributed  heuristic  is  useful  in  domains  with  a  high  num¬ 
ber  of  privately-dependent  actions,  which  are  a  sign  of  necessary  interaction 
among  the  agents,  not  visible  to  the  projected  heuristic,  or  with  some  crucial 
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Figure  4:  Subset  of  related  work  to  the  MADLA  planner.  The  solid  arrows  represent  compar¬ 
ison  and  improvement  (in  most  cases);  dashed  lines  represent  informal  comparison  not  over 
coverage  (number  of  solved  problems)  or  runtime;  and  dotted  lines  represent  inspiration  of 
the  approaches.  Additionally,  superscripts  c,  i  and  *  stand  for  centralized,  incomplete  and 
optimal  algorithms  respectively. 


information  being  private  (as  in  woodworking).  In  contrast,  the  projected  heuris¬ 
tic  perforins  better  on  domains  where  the  agents  are  interchangeable  (driverlog, 
zenotravel),  or  the  distributed  heuristic  misguides  the  search  (depot).  In  most 
cases,  the  MADLA  search  is  able  to  let  the  better  heuristic  dominate  the  search 
and  thus  on  most  domains,  MADLA  dominates  the  single  heuristic  approaches. 
Notice,  that  on  some  domains  MADLA  even  improves  the  coverage  over  the  dis¬ 
tributed  heuristic  (namely  blocksworld  and  elevators).  Only  in  significantly  hard 
problems,  where  the  distributed  heuristic  takes  long  to  compute  and  does  not 
give  better  estimates  (depot),  or  the  projected  heuristic  takes  long  to  compute 
and  misguides  the  search  (woodworking),  the  MADLA  search  does  not  dominate 
the  GBFS.  It  is  also  important  to  note  that  in  woodworking,  both  projected  and 
distributed  FF  heuristics  ignore  significant  number  of  dead-ends  (as  described 
in  Section  3.1.3),  thus  slowing  the  search  and  solving  less  problems. 


4-3.  Comparison  with  the  State  of  the  Art 

The  planners  for  MA-STRIPS  and  related  multiagent  models  in  the  litera¬ 
ture  spread  from  centralized  single-core  planners  [18,  14],  over  parallel  multi-core 
ones  [3,  19],  to  fully  distributed  implementations  with  communication  within 
one  physical  computer  or  over  the  network  [3,  5,  12].  In  Figure  4,  we  summa¬ 
rize  relevant  related  planners  to  our  MADLA  planner  (i.e. ,  compatible  with  the 
MA-STRIPS  model)  with  visualization  of  what  planners  were  compared  (and 
thus  usually  outperformed)  which  planners. 

It  is  not  surprising  that  the  first  MA-STRIPS  planner  called  Planning 
First  [20]  was  based  on  the  principles  used  in  the  MA-STRIPS  paper  [2].  Plan¬ 
ning  First  uses  a  coordination  of  local  forward-search  planners  by  a  distributed 
solver  of  Constraint  Satisfaction  Problem  (DisCSP).  Planning  First  was  a  first 
representative  coordination- centric  planners  and  outperformed  Java  implemen¬ 
tation  of  the  FF  planner  JavaFF  [21].  Besides  coordination  centric  Distributed 
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Planning  by  Graph  Merging  (DPGM)  [8,  22],  Best  Response  Planning  [23],  and 
/z-SATPLAN  [24]  a  planner  Distoplan  [25]  and  following  A#  planner  [26]  pi¬ 
oneered  an  idea  of  optimal  planning  by  intersection  of  Finite  State  Machines 
(FSM)  representing  the  local  plans  of  the  agents.  This  idea  was  extended  and 
practically  refined  in  a  satisficing  planner  based  on  nondeterministic  FSMs  rep¬ 
resenting  local  agents’  plans  and  its  merging—  Planning  State  Machine  Merging 
(PSMM)  in  [19]. 

The  multiagent  state-space  search  was  firstly  used  in  distribution  of  optimal 
A*  search  called  MAD-A*  [3]  with  admissible  projected  landmark  heuristics  LM- 
cut  [27]  and  Merge&Shrink  [28].  Multiagent  state-space  search  was  also  adopted 
by  specific  set  of  multiagent  planners  running  as  centralized  sequential  processes, 
therefore  these  planners  are  not  directly  comparable  with  the  distributed  rnultia- 
gent  planners  as  they  have  no  communication  overheads  and  runs  on  one  thread. 
These  planners  are  highly  efficient  algorithms  able  to  outperform  the  state-of- 
the-art  centralized  planners  as  LAMA  [16],  FF  [6],  or  LPG  [29].  Multi- Agent 
Planning  by  Plan  Reuse  (MAPR)  [14]  is  an  incomplete  approach,  based  on  best- 
response  principle,  sequentially  using  a  plan  repairing  algorithm  (particularly 
LPG-adapt  [30])  and  its  variation  called  Plan  Merging  by  Reuse  (PMR)  [31] 
proposes  a  distribution  scheme  for  MAPR.  Agent  decomposition-based  planner 
(ADP)  [18]  uses  Relaxed  Planning  Graphs  for  repeated  extraction  of  subgoals 
of  a  multiagent  planning  problem  and  the  FF  planner  to  sequentially  solve  the 
extracted  goals.  A  recent  distributed  multiagent  state-space  search  planner  in¬ 
spired  by  MAD-A*  is  called  Greedy  Privacy  Preserving  Planner  (GPPP)  [15],  it 
uses  landmarks  refined  for  particular  agents  and  is  based  on  an  iterative  deep¬ 
ening  backtrack  search  in  relaxed  planning  subproblems.  These  planners  are 
representative  of  agent-centric  planners  as  the  search  in  each  agent  drives  the 
process  of  planning.  The  proposed  MADLA  planner  fits  this  category. 

The  multiagent  partial-order  planners  were  firstly  studied  as  a  multiagent 
temporal  planner  TFPOP  [32],  which  required  extended  model  in  contrast  to 
MA-STRIPS  because  of  additional  requirement  on  temporal  constraints.  A 
series  of  partial-order  planners  based  on  model  compatible  with  MA-STRIPS 
begun  with  incomplete  MAP-POP  [33]  and  two  versions  of  practically  improved 
planners  FMAP  [34,  12],  FMAP  uses  a  parallelized  local  forward  search  to 
successively  complete  partial  plans  of  the  agents.  The  coordination  subproblem 
is  driven  by  a  distributed  form  of  heuristic  utilizing  Domain  Transition  Graphs 
(DTGs)  for  approximation  of  relaxed  plans  [9].  The  computation  of  the  heuristic 
can  be  parallelized  as  well.  FMAP  can  be  understood  as  both  coordination 
centric  and  agent  centric  as  the  planning  process  follows  the  former  and  the 
heuristics  the  latter  paradigm. 

In  Table  3,  we  show  comparison  of  problems  solved  by  MADLA,  and  four 
complete  and  distributed  multiagent  planners.  The  results  show  that  MADLA 
looses  considerably  in  woodworking08  and  openstacks  against  all  planners  sup¬ 
porting  action  costs.  As  mentioned  in  the  previous  section,  these  domains  con¬ 
tain  substantial  number  of  dead-ends  of  which  the  FF  heuristic  (especially  in 
the  projected  form)  is  oblivious. 

Although  the  result  table  does  not  contain  the  PMR  planner,  MADLA  out- 
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domain 

rdFF 

GPPP 

PSMM 

FMAP 

MADLA 

blocksworld  (35) 

6.8 

3 

25 

19 

34.2 

depot  (20) 

6.2 

8 

0 

6 

12.4 

driverlog  (20) 

14 

9 

13 

15 

15.8 

elevators08  (30) 

2.9 

16+ 

4 

30 

29.7 

logisticsOO  (20) 

5.8 

20 

9 

10 

20 

openstacks  (30) 

11.7 

0* 

30 

23 

17.5 

rovers  (20) 

14.7 

10 

14 

19 

20 

satellites  (20) 

10.8 

16 

8 

16 

20 

woodworking08  (30) 

5.6 

0* 

25 

22 

4.5 

zenotravel  (20) 

6.1 

20 

17 

18 

19 

total  (245) 

84.6 

102 

145 

178 

192.9 

Table  3:  Comparison  MADLA  and  state-of-the-art  planners.  'Used  version  of  the  domain  in 
GPPP  experiments  without  action  costs,  consisting  of  16  problems.  tGPPP  does  not  support 
action  costs. 


performs  it  on  the  presented  benchmark  set  as  well,  just  by  the  fact  that  PMR 
is  an  incomplete  planner  as  stated  in  [14,  31].  PMR  solves  only  problems  where 
each  goal  fact  is  solvable  by  a  single  agent.  Thus  it  does  not  solve  problems  of 
depot,  logisticsOO,  openstacks,  and  woodworking08  domains.  Even  if  PMR  solved 
all  problems  of  all  other  domains,  MADLA  would  outperform  it  by  35%. 

Against  rdFF  [5],  our  recent  multiagent  heuristic  search  using  different  dis¬ 
tribution  scheme  and  implementation  of  FF,  MADLA  shows  more  than  2x 
improvement  consistently  over  all  domains  with  exception  of  woodworking08. 
Similarly,  MADLA  outperforms  GPPP  nearly  2x  over  all  domains  and  PSMM 
by  33%.  Finally,  MADLA  solves  nearly  15  more  problems  of  the  benchmark 
set  in  contrast  to  currently  top  performing  multiagent  planner  FMAP,  which 
correspond  to  8%  improvement. 

5.  Conclusion 

Similarly  as  for  classical  planning,  heuristic  state-space  search  is  a  viable 
technique  for  multiagent  planning  as  well.  However,  in  contrast  to  the  classical 
heuristic  search,  the  multiagent  setup  raises  its  own  challenges.  The  dilemma 
of  highly  informed  but  slow  versus  less  informed  but  fast  heuristic  estimators  is 
manifested  in  the  dichotomy  of  projected  heuristic  restricted  to  the  agent’s  local 
view  of  the  problem  versus  distributed  heuristic  estimating  the  global  heuristic 
value  at  the  cost  of  significant  communicational  or  computational  overhead.  The 
MADLA  planner  combines  both  fast  local  projection  of  the  FF  heuristic  with 
a  global  distributed  FF  heuristic  in  an  attempt  to  combine  their  benefits  and 
mitigate  their  negative  effects. 

The  technique  used  in  MADLA  to  combine  the  heuristic  estimators  is  based 
on  the  classical  nrulti-heuristic  search,  but  it  does  not  evaluate  all  states  by 
both  heuristics.  Instead,  the  local  heuristic  is  used  only  while  the  distributed 
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heuristic  is  idle,  that  is  waiting  for  replies  from  other  agents.  The  projected 
heuristic  is  used  to  fully  utilize  the  computational  resources  of  the  agent,  even 
for  less-infornred,  but  faster  search  of  the  state  space  of  the  individual  agent. 
When  the  estimation  of  the  more-informed  heuristics  is  finished,  the  evaluated 
state  is  used  in  both  searches.  This  principle  was  theoretically  analyzed  in  a 
general  state-space  search  framework  and  practically  evaluated  in  form  of  a  MA- 
STRIPS  based  planner,  which  outperforms  all  current  mutliagent  planners  in  a 
coverage  metric  over  a  common  benchmark  set. 

Two  research  directions  are  left  for  future  work.  First,  whether  the  principle 
of  MADLA  search  can  be  used  for  optimal  multiagent  planning  with  similarly 
promising  results.  Second,  as  the  results  show,  the  planner  does  not  perform 
well  on  domains  with  dead-ends,  which  relaxation  heuristics  are  oblivious  to.  We 
can  ask  what  efficiency  boost  could  be  achieved  by  combination  not  only  of  one 
heuristic  as  projected  and  distributed  estimators,  but  also  with  other  heuristics 
possibly  orthogonal  to  the  first  heuristic  pair;  landmarks  are  an  obvious  choice. 

Generally,  utilizing  the  principle  of  combining  fast  and  less  accurate  and 
slow  but  more  informed  heuristics  in  an  asynchronous  manner  may  be  also 
an  interesting  research  direction  in  the  classical  planning  (esp.  on  multicore 
machines)  and  search  in  general. 
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Abstract 

In  the  context  of  modeling  and  reasoning  about  agent  actions, 
contingent  and  classical  planning  can  often  be  respectively 
seen  as  adopting  “extreme  pessimism”  and  “extreme  opti¬ 
mism”  about  the  action  outcomes.  For  many  everyday  sce¬ 
narios  of  human  reasoning  (and  thus  for  many  types  of  au¬ 
tonomous  systems),  both  these  approaches  are  just  too  ex¬ 
treme.  Following  Jensen,  Veloso,  and  Bryant  (2004),  we  ex¬ 
amine  a  planning  model  that  interpolates  between  classical 
and  contingent  planning  via  tolerance  to  arbitrary  k  faults 
occurring  during  plan  execution.  We  show  that  an  impor¬ 
tant  fragment  of  this  fault  tolerant  planning  (FT-planning) 
exhibits  both  an  appealing  solution  structure,  as  well  as  ap¬ 
pealing  worst-case  time-complexity  properties.  We  also  show 
that  such  FT-planning  tasks  can  be  efficiently  compiled  into 
classical  planning  as  long  as  the  number  of  possible  faults 
per  operator  is  bounded  by  a  constant,  and  we  show  that  this 
compilation  can  be  attractive  in  practice. 

Introduction 

To  date,  contingent  and  classical  planning  appear  to  be 
the  two  major  approaches  to  non-probabilistic  planning  un¬ 
der  full  observability.  In  contingent  planning,  at  least 
some  aspects  of  system  dynamics  are  modeled  by  opera¬ 
tors  with  non-deterministic  effects,  and  a  plan  should  guar¬ 
antee  reaching  a  goal  state  under  any  realization  of  the  ac¬ 
tions  it  prescribed.  In  classical  planning,  the  operators  are 
all  set  to  be  deterministic,  modeling  only  the  singular  in¬ 
tended  effects  of  each  action.  While  contingent  plans  pro¬ 
vide  much  stronger  guarantees  on  reaching  the  goal  with  re¬ 
spect  to  the  true  physics  of  the  modeled  system,  they  are  also 
much  harder  to  generate  (both  worst-case  and  empirically), 
and  quite  often  they  may  simply  not  exist. 

In  the  physical  world,  no  actions  are  really  guaranteed  to 
succeed.  However,  non-determinism  in  real-world  domains 
is  often  caused  by  infrequent  errors  that  make  otherwise  de¬ 
terministic  operators  fail.  Hence,  many  unsolvable  contin¬ 
gent  planning  tasks  become  solvable  if  we  assume  that  no 
more  than  some  n  exceptional/faulty  action  effects  will  oc¬ 
cur  along  the  purported  plan  to  the  goal.  In  the  past,  this 
observation  brought  numerous  researchers  to  consider  ex¬ 
plicit  representation  and  reasoning  about  faults  of  agents’ 
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actions  (Georgeff  and  Lansky  1986;  Williams  et  al.  2003; 
Giunchiglia,  Spalazzi,  and  Traverso  1994).  In  particu¬ 
lar,  Jensen,  Veloso,  and  Bryant  (2004)  suggested  a  model 
of  fault  tolerant  planning  (FT-planning).  and  developed  first 
algorithms  for  generating  plans  that  are  robust  for  a  single 
fault  occurring  during  plan  execution.  This  model  is  of  our 
focus  here. 

Departing  from  contingent  planning  and  generalizing  the 
FT-planning  model  of  Jensen,  Veloso,  and  Bryant  (2004), 
we  show  that,  while  FT-planning  remains  in  general  as  com¬ 
putationally  hard  as  contingent  planning,  one  of  its  practi¬ 
cably  most  valuable  fragments,  namely  the  one  considered 
by  Jensen,  Veloso,  and  Bryant  (2004),  (1)  is  in  PSPACE,  (2) 
falls  into  N  P  when  restricted  to  plans  with  only  polynomial- 
length  executions,  and  (3)  is  guaranteed  to  admit  stationary 
solutions  for  solvable  problems,  solutions  that  sometimes  in¬ 
duce  (possibly  cyclic)  strong  contingent  plans.  Furthermore, 
we  show  that  these  FT-planning  tasks  can  be  efficiently  com¬ 
piled  into  equivalent  classical  planning  tasks  in  a  way  that  is 
sound,  complete,  and  practicable. 

Our  results  join  a  growing  body  of  work  on  planning  un¬ 
der  uncertainty  and/or  partial  observability  via  compilation 
to  classical  planning  (Palacios  and  Geffner  2009;  Albore, 
Palacios,  and  Geffner  2009;  Bonet  and  Geffner  2011;  Braf- 
man  and  Shani  2012a;  Taig  and  Brafman  2013).  At  a  high 
level,  FT-planning  is  an  instance  of  “assumption-based  plan¬ 
ning,”  and  the  latter  term  has  already  been  used  for  a  broad 
range  of  ideas  and  techniques  (Albore  and  Bertoli  2004; 
2006;  Albore  and  Geffner  2009;  Bonet  and  Geffner  2011; 
Gobelbecker,  Gretton,  and  Dearden  2011;  Davis-Mendelow, 
Baier,  and  Mcllraith  2012).  Closest  in  spirit  to  our  work 
here-though  in  two  different  ways-are  probably  the  works 
of  Albore  and  Bertoli  (2006)  and  Davis-Mendelow,  Baier, 
and  Mcllraith  (2012).  Albore  and  Bertoli  suggested  an  in¬ 
teresting  planning  approach  in  which  assumptions  about  op¬ 
erator  effects  are  provided  a  priori  as  a  linear  temporal  logic 
formula,  and  the  planner  takes  these  assumptions  as  ax¬ 
ioms.  In  the  worst-case,  however,  this  approach  remains  as 
hard  as  contingent  planning.  Davis-Mendelow  et  al.  exploit 
assumption-based  assertions  about  the  initial  state  to  sug¬ 
gest  a  middle -ground  between  classical  planning  and  confor¬ 
mant,  or  zero  observability,  planning.  The  latter,  however,  is 
very  different  from  contingent  planning,  both  conceptually 
and  complexity-wise  (Bonet  2010). 


Planning  Formalisms  and  Solution  Concepts 

Non-deterministic  planning  tasks  with  full  observability 
correspond  to  succinctly  represented,  goal-oriented  non- 
deterministic  Markov  decision  processes  (Puterman  1994). 
Several  languages  for  succinctly  representing  such  tasks 
are  in  use  (Hoffmann  and  Brafman  2005;  Bonet  2010; 
Davis-Mendelow,  Baier,  and  Mcllraith  2012).  To  simplify 
presentation,  here  we  adopt  a  minimalistic  extension  of 
STRIPS  (Fikes  and  Nilsson  1971)  to  non-deterministic  op¬ 
erator  effects. 

A  planning  task  is  given  by  a  quadruple  n  = 
(P,  O,  Sq,G).  P  is  a  set  of  n  propositions,  with  world 
states  S  being  represented  by  complete  valuations  of  P, 
and  usually  discussed  as  sets  of  propositions  that  hold  true 
in  them.  so  S  is  an  initial  state1,  and  G  is  a  subset 
of  P:  a  state  s  is  a  goal  state  iff  G  C  s,  and  the  set  of 
all  goal  states  is  denoted  by  Sq  ■  O  is  a  set  of  operators 
o  =  (pre(o),  eff(o))  where  the  precondition  pre  is  a  subset 
of  propositions  P,  and  eff  =  {ei, . . . ,  e„o}  is  a  set  of  pos¬ 
sible  effects  of  o.  Each  possible  effect  e  £  eff  is  given  by 
a  pair  (add(e),  del(e))  of  subsets  of  P,  corresponding  to  its 
add  and  delete  lists,  respectively.  An  operator  o  is  applicable 
in  state  s  iff  pre(o)  C  s,  and  the  set  of  all  such  operators  is 
denoted  by  O(s).  If  o  £  O(s)  is  applied  in  s,  it  changes  the 
world  to  one  of  the  states  i?es[s;  o]  =  Ueeeff(o){^es[s!  e]}j 
where  Res[s;  e]  =  (s\del(e))  Uadd(e)  is  the  state  resulting 
from  the  effect  e  occurring  in  s.  A  (contingent)  plan  for  a 
task  n  is  an  action  strategy  that  guarantees  reaching  a  goal 
state  s  £  Sg  from  ,sq,  under  any  realization  of  the  operators 
applied  along  the  way;  the  process  of  search  for  contingent 
plans  is  called  contingent  planning. 

While  nondeterministic  operators  must  be  dealt  with  in 
one  form  or  another  in  many  planning  applications,  two 
problems  particular  to  contingent  planning  must  be  taken 
into  account.  First,  deciding  whether  a  contingent  plan  ex¬ 
ists  is  EXPTIME-complete(Rintanen2004).2  Second,  many 
tasks  admit  no  contingent  plans,  and  this  is  true  even  for  sim¬ 
ple  tasks  that  humans  feel  comfortable  dealing  with  (Cimatti 
et  al.  2003;  Pistore  and  Vardi  2007).  In  that  respect,  a  prag¬ 
matic  alternative  to  contingent  planning  is  classical  plan¬ 
ning,  operators  are  deterministic.  By  adopting  classical 
planning  as  an  abstraction  of  contingent  planning,  we  as¬ 
sume  that  we  know  precisely  what  will  happen  when  an  op¬ 
erator  o  is  applied  in  state  s.  This  assumption  is  then  “en¬ 
coded”  at  the  level  of  individual  operators  by  what  is  called 
determinization,  reducing  the  set  of  possible  effects  of  each 
operator  to  exactly  one  effect. 

While  being  much  more  restricted,  classical  planning 
resolves  to  a  large  extent  the  two  aforementioned  short¬ 
comings  of  contingent  planning.  First,  restricting  each 
state/action  pair  to  a  sole  possible  successor  often  renders 
unsolvable  problems  solvable.  Second,  classical  planning 
is  in  PSPACE  (Bylander  1994),  and  more  importantly,  it  is 

*We  assume  here  that  there  is  no  uncertainty  about  the  initial 
state,  and  later  discuss  the  impact  of  this  assumption. 

2 EXPTIM E-completeness  still  holds  even  for  testing  the  exis¬ 
tence  of  plans  that  reach  the  goal  with  probability  exceeding  p  for 
probabilistic  problems  with  full  observability  (Littman  1997). 


in  NP  if  restricted  to  polynomial-length  plans.  Fast  but  not 
least,  classical  plans  are  structurally  simple,  constituting  lin¬ 
ear  sequences  of  operators.  Together  with  N  P-membership, 
this  structural  simplicity  allows  for  exploiting  various  OR- 
graph  search  techniques  for  developing  empirically  efficient 
solvers  for  classical  planning  (Hoffmann  and  Nebel  2001; 
Helmert  2006;  Rintanen,  Heljanko,  and  Niemelii  2006;  Kiss- 
mann  and  Edelkamp  2012).  As  a  result,  combining  classical 
planning  with  online  re -planning  in  unexpected  situations  is 
a  popular  and  effective  approach  to  closed-loop  control  of 
autonomous  systems  (Yoon  et  al.  2008;  Talamadupula  et 
al.  2010;  Domshlak  et  al.  201 1;  Bonet  and  Geffner  201 1; 
Brafman  and  Shani  2012b). 

Fault  Tolerant  (Contingent)  Planning 

Given  the  relative  pros  and  cons  of  contingent  and  classical 
planning,  the  first  question  one  might  ask  is:  If  these  are 
the  two  extremes,  how  can  we  interpolate  between  them  in  a 
simple  and  useful  manner?  This  question  brought  us  to  con¬ 
sider  fault  tolerant  planning :  planning  under  the  assumption 
that  no  more  than  some  k  unintended  effects  of  the  operators 
will  occur  along  the  purported  plan  to  the  goal,  but  at  the 
same  time,  under  a  requirement  for  the  plans  to  be  provably 
robust  for  up  to  k  such  operator  faults  during  plan  execution. 
Fault  tolerant  planning  was  originally  introduced  by  Jensen, 
Veloso,  and  Bryant  (2004)  in  order  to  bring  some  key  infor¬ 
mation  from  probabilistic  uncertainty  models  to  qualitative 
non-deterministic  planning.  The  basic  idea  is  to  associate 
the  contingent  planning  task  at  hand  with  an  explicit  dis¬ 
tinction  between  the  primary  and  exceptional  effects  of  its 
operators.  The  model  we  adopt  for  that  purpose  is  simply 
a  function  T  that  maps  each  possible  effect  of  each  opera¬ 
tor  to  the  “number  of  exceptions,”  or  unintended  artifacts, 
associated  with  this  effect.  The  operator  effects  e  for  which 
T(e)  =  0  correspond  to  the  primary  effects  of  the  respective 
operators  in  s. 

Definition  1  Let  n  =  ( P ,  O ,  sq,  G)  be  a  contingent  plan¬ 
ning  task.  An  exception  model  for  n  is  a  function  T  : 
UoeO  eff(°)  N,  computable  in  time  polynomial  in  |  |H|  |. 
If  for  each  operator  o  £  O,  \{e  \  e  £  eff(o),  P{e)  =  0}  |  < 
a,  then  T  is  called  a-primary.  Likewise,  if  for  each  oper¬ 
ator  o  £  O,  \{e  |  e  £  eff(o),J?(e)  =  0} |  >  0,  then  T  is 
called  normative. 

In  simple  terms,  an  exception  model  is  a-primary  if  at 
most  a  effects  of  each  operator  are  considered  to  be  its  pri¬ 
mary  effects,  and  it  is  normative  if  each  operator  is  associ¬ 
ated  with  at  least  one  primary  effect.  In  these  terms,  the 
work  of  Jensen,  Veloso,  and  Bryant  (2004)  has  been  de¬ 
voted  to  fault  tolerant  planning  under  1-primary  normative 
exception  models,  which  seem  to  cover  well  operator  non¬ 
determinism  that  stems  from  physical  complications  of  ex¬ 
ecuting  agent  actions  in  the  real  world.3  Associating  non- 
deterministic  operators  with  exception  models  allows  for  a 

3If,  however,  some  operators  model  knowledge  acquisition,  i.e., 
sensing,  then  (part  of  the)  operator  non-determinism  will  be  due  to 
primary  operator  effects,  and  thus  planning  with  a-primary  models 
for  a  >  1  is  not  of  theoretical  interest  only. 


simple  relaxation  of  contingent  planning  to  planning  under 
fault  tolerance  requirements  as  above.  Let  II  be  a  contin¬ 
gent  planning  task,  F  be  an  exception  model  for  II,  and  i r 
be  an  action  policy  for  II.  Overloading  the  notation,  for 
an  execution  p  =  (s o,  eo, . . . ,  Sj,  e^, . . .)  of  7 r,  we  define 

Hp)  =  TZoHei). 

-  An  execution  p  of  7r  is  called  /{-admissible  if  Tip)  <  /,:. 

-  Action  policy  7r  is  a  //-plan  for  II  if  each  of  its  n- 

admissible  executions  is  finite  and  reaches  the  goal. 

In  what  follows,  we  refer  to  triplets  (II,  F,  k)  as  above  as 
fault  tolerant  (FT)  planning  tasks,  and  solutions  for  such 
tasks  are  precisely  /{-plans  for  II  under  F. 

In  general,  //-plan  7r  can  be  either  a  stationary  (possibly 
partial)  policy  7r  :  S  — >  O,  or  a  non-stationary  policy 
7r  :  S  x  N  — »  O  that  depends  on  the  current  state  and 
the  number  of  “failures  so  far.”  As  noted  by  Jensen  et 
al.  (2004),  the  latter  can  be  captured  as  a  stationary  policy 
for  a  certain  contingent  planning  task  I  V  Jr-K  1  that  we  refer 
to  as  (F .  k) -reformulation  of  II:  Given  a  FT-planning 
task  (n  =  (. p,o,s0,g),t,k ),  n(-^»  is  a  contingent 
planning  task  over  states  S' =  S  x  {0, . . . ,  «},  op¬ 
erators  O,  initial  state  s =  (s0,0),  and  goal  states 

4™  =  (JJLoKM)  I  s  e  S'g}.  For  each  s  £  S, 
o  £  O(s),  and  0  <  i  <  k,  o  is  applicable  in  (s,i),  and  if 
applied,  it  changes  the  world  to  one  of  the  states 

Ses[(s,  j);o]  =  (J  {f?es[(s,i);e]},  (1) 

e£eff  (o), 

K-(i+^(e))>0 

where  f?es[(s,  i);e]  =  (f?es[s;  e],i  +  F(e)). 

It  is  not  hard  to  verify  that  there  is  a  bijective  corre¬ 
spondence  between  plans  for  I and  non-stationary  n- 
plans  for  (II,  F .  k),  but  the  relation  to  stationary  //-plans  for 
(II,  F .  k)  is  less  immediate.  Theorem  1  clarifies  this  mat¬ 
ter.  Let  7T  be  a  (not  necessarily  a  plan)  policy  for  II-jr‘KL 
The  execution  tree  Tn(s,i)  is  the  tree  of  possible  execu¬ 
tions  of  7T  starting  at  (s,  i),  with  nodes  corresponding  to 
states  of  I edges  corresponding  to  operator  effects, 
and  C  S F,F)  denoting  the  set  of  internal  nodes  of 

Tv.  (Henceforth,  X^so,  0)  is  referred  to  for  short  as  Tn.) 

Theorem  1  Let  (n,  F.  /,:)  be  a  solvable  FT-planning  task. 
If  F  is  normative,  then  there  exists  a  stationary  n-plan  7 r 
for  (n,  F ,  k).  In  contrast,  there  exist  solvable  FT-planning 
tasks  (with  nonnormative  exception  models)  for  which  there 
are  no  stationary  K-plans. 

The  proof  of  the  second  sub-claim  is  by  example:  Let  n 
be  a  contingent  planning  task  over  states  S  =  {so,  •  •  • ,  Sg}, 
operators  O  =  {00, . . . ,  05}  as  in  Figure  la,  initial  state  So, 
and  Sg  =  { -sx } .  Assume  an  exception  model  T  for  n  as  in 
the  last  column  of  Figure  la.  Figure  lb  depicts  the  only  con¬ 
tingent  plan  for  the  reformulation  1 1  ^-1  and  the  respective 
1-plan  for  (n,  T .  1)  is  not  stationary:  different  actions,  o\ 
and  02,  are  taken  at  state  si  with  0  and  1  “exceptions  so  far,” 
respectively,  and  such  a  history-dependent  choice  of  opera¬ 
tor  at  si  is  unavoidable. 


Figure  1 :  Illustrations  for  the  examples  around  Theorem  1 . 

Note  that  IF  in  the  above  example  is  not  normative  be¬ 
cause  neither  of  the  effects  of  operator  o\  is  primary.  With 
normative  exception  models,  the  situation  is  indeed  differ¬ 
ent.  Let  (n,  F .  /,:}  be  a  solvable  FT-planning  task  with  a 
normative  model  F,  and  7r  be  a  contingent  plan  for  I  f 
If  n(s,i)  =  7r (s,j)  for  all  pairs  of  reformulation  states 
(s,  i),  (s,  j)  £  then  we  are  done  because  the  /{-plan 

for  (n,  F .  k)  corresponding  to  7r  is  stationary.  Otherwise,  let 
(s,  i),  (s,  j)  £  S^’K\  i  <  j  be  a  pair  of  reformulation  states 
for  which  7r (s,i)  7^  7r (s,j).  The  proof  is  accomplished  by 
showing  that  n',  obtained  from  ir  by  replacing  7 r(s,j)  with 
7 r(s,«),  is  also  a  plan  for  ]jF,Kf  anc|  thUS  we  can  always 
iteratively  reduce  a  non-stationary  n  to  a  stationary  one. 

Note  that,  by  the  construction  used  in  the  proof  of  The¬ 
orem  1,  if  7r  is  a  non-stationary  /c-plan  for  a  FT-planning 
task  (n,  F,  k)  with  a  normative  model  F,  then  7r  can  be  ef¬ 
ficiently  translated  into  a  stationary  //-plan  n'  for  (n,  F,  /.:) . 
Likewise,  for  some  of  such  pairs  7r  and  7 r',  7r'  may  turn  out  to 
be  a  strong  cyclic  contingent  plan  for  n.  For  instance,  let  n 
be  a  contingent  planning  task  over  states  S  =  {so,  •  •  • ,  S4}, 
operators  O  =  {o0, . . .  ,o3},  initial  state  So,  and  G  =  S4. 
The  operators  are  defined  as  in  the  table  in  Figure  lc,  and 
the  exception  model  F  associated  with  II  is  given  in  the  last 
column  of  that  table.  Figure  Id  depicts  a  contingent  plan  7 r 
for  the  reformulation  and  the  respective  1-plan  for 

(n,  F.  1)  is  not  stationary:  different  actions,  00  and  03,  are 
taken  at  state  So  with  0  and  1  “exceptions  so  far,”  respec¬ 
tively.  However,  if  we  modify  7r  as  in  the  proof  of  Theo¬ 
rem  1,  the  resulting  plan  n'  for  I  rFK:i  will  induce  a  strong 
cyclic  contingent  plan  7t*  (s^)  =  o,  for  n. 

Complexity  and  Compilation 

Two  decision  problems  are  of  interest  in  the  context  of  FT- 
planning:  Let  n  be  a  contingent  planning  task,  and  F  be  an 
o-primary  model  for  n. 

FT-PLAN-ct-k:  Does  n  have  a  /{-plan? 


POLY-FT-PLAN-a-Ac:  Does  II  have  a  K-plan  such  that  all  its 

K- admissible  executions  reach  the  goal  after  a  polynomial 

number  of  steps? 

At  first  view,  the  effective  difference  between  FT-PLAN- 
a-K  and  contingent  planning  is  not  clear.  In  general,  for  suf¬ 
ficiently  large  values  of  a  (e.g.,  a  =  |5|),  contingent  plan¬ 
ning  can  trivially  be  reduced  to  FT-PLAN-ct-K  for  any  k,  and 
hence  the  latter  decision  problem  is  EXPTI M  E-hard.  In  fact, 
FT-PLAN-a-K  can  be  polynomially  reduced  to  FT-PLAN-2- 
k  by  simulating  each  operator  with  a  primary  effects  by  a 
“ladder”  of  log  a  operators,  each  with  at  most  two  primary 
effects.  Hence,  even  FT-PLAN-2-ac  is  EXPTI M E-hard.  How¬ 
ever,  while  the  definition  of  exception  models  is  rather  gen¬ 
eral,  the  specific  settings  that  brought  us  to  this  investigation 
correspond  to  the  normative  1-primary  exception  models 
considered  by  Jensen,  Veloso,  and  Bryant  (2004).  In  what 
comes  next,  we  focus  on  that  fragment  of  FT-planning.4 

FT-planning  with  1-primary  models 

Unlike  FT-PLAN-2-k,  FT-PLAN-1-k  nicely  generalizes  clas¬ 
sical  planning,  which  simply  corresponds  to  first  associat¬ 
ing  the  contingent  planning  task  with  a  normative  1-primary 
model,  and  then  adopting  the  extreme  optimism  that  no 
failures  will  occur  along  the  purported  plan  to  the  goal. 
In  other  words,  the  decision  version  of  classical  planning 
is  precisely  the  FT-planning  class  FT-PLAN-1-0.  At  the 
same  time,  FT-PLAN-1-1  already  goes  way  beyond  classi¬ 
cal  planning.  While  plans  for  FT-PLAN-1-1  are  restricted 
to  at  most  one  operator  failure  per  possible  plan  execu¬ 
tion,  these  failures  are  bounded  neither  to  specific  opera¬ 
tors  nor  to  specific  stages  of  the  purported  plan.  Hence, 
while  plans  for  FT-PLAN-1-0  are  linear  sequences  of  ac¬ 
tions,  plans  for  FT-PLAN-1-k  with  k  >  0  are  tree-structured, 
and  may  actually  exhibit  substantial  branching:  unlike  in 
(EXPSPACE-complete)  planning  under  the  fc-branching  as¬ 
sumption  (Bonet  2010),  plans  for  FT- PLAN-  1-re  may  have 
to  always  interleave  between  acting  and  branching,  even  for 

K  =  1. 

For  example,  suppose  that  a  robot  should  move  from 
x  |  to  x-,  on  the  map  depicted  in  Figure  2(a).  Movements 
on  the  segments  (x\,X2)  and  (x\,xf)  are  considered  safe 
and  thus  are  modeled  by  deterministic  operators.  Move¬ 
ments  move{xi,Xj)  on  the  other  three  segments  are  mod¬ 
eled  by  non-deterministic  operators  with  three  possible  ef¬ 
fects:  move(xi,Xj )  typically  brings  the  robot  to  Xj,  with 
no  side  effects,  but  it  may  also  bring  the  robot  to  Xj  with  a 
flat  tire,  or  keep  it  at  Xi  for  the  same  reason.  Initially  the 
robot  has  no  flats,  but  also  no  spare  tires.  A  single  spare 
tire  can  be  picked  up  at  each  of  the  two  intermediate  loca¬ 
tions  x2  and  x3.  Figures  2(b-d)  depict  stationary  0-plan,  1- 
plan,  and  2-plan  for  the  respective  FT-planning  tasks,  under 

4While  the  motivation  for  the  FT-PLAN-1-k  fragment  comes 
from  its  excellent  applicability  in  practice,  it  is  obviously  not  the 
only  fragment  of  FT  planning  to  be  so  motivated.  For  instance, 
if  some  operators  model  knowledge  acquisition,  i.e.,  sensing,  then 
(part  of  the)  operator  non-determinism  will  be  due  to  primary  op¬ 
erator  effects,  and  thus  planning  with  a- primary  models  for  a  >  1 
is  not  of  a  theoretical  interest  only. 


Figure  2:  K-plans  for  an  inline  example 


a  normative  1-primary  exception  model  that  maps  the  pri¬ 
mary  effects  of  all  actions  to  0,  and  the  exceptional  effects 
of  the  non-deterministic  move  actions  to  1.  In  the  triplet 
denotation  [x,  y,  z\  of  the  states,  x  is  the  robot’s  location, 
y  £  {ok,F}  is  the  status  of  the  tire,  and  z  is  the  number 
of  spare  tires  in  the  robot’s  possession.  State  representation 
in  this  problem  also  addresses  the  availability  of  the  spare 
tires  at  X2  and  23,  but  we  omit  this  information  in  the  fig¬ 
ure  for  brevity.  The  dashed  arrows  depict  possible  effects 
of  the  actions  that  the  agent  ignores  at  planning.  The  0-plan 
is  a  classical  plan,  and  it  is  as  simple  as  the  example  itself. 
The  1-plan  is  already  more  involved:  to  guarantee  reaching 
X5  under  a  possibility  of  a  single  fault,  the  robot  picks  up  a 
spare  tire  at  x3.  In  turn,  the  2-plan  in  Figure  2d  prescribes 
that  the  robot  first  collect  both  spare  tires  at  X2  and  x3,  and 
only  then  start  moving  towards  X5,  replacing  flat  tires  on  the 
way,  if  needed. 

Still,  despite  the  structural  complexity  of  K-plans,  the 
EXPTI M E-hardness  proof  for  FT-PLAN-2-k  does  not  carry 
over  to  FT-PLAN-1-k,  and  for  a  good  reason:  Theorem  2  be¬ 
low  shows  that  FT-PLAN-1-r  is  in  PSPACE,  that  is,  worst- 
case  not  harder  than  classical  planning.  Moreover,  Theo¬ 
rem  3  then  shows  even  closer  resemblance  between  these 
two  formalisms,  namely  that  POLY-FT-PLAN-1-k  is  in  NP. 


Theorem  2  ft-plan-1-k  is  in  PSPACE. 

A  non-deterministic  algorithm  (that  can  also  be  compiled 
to  a  Turing  machine)  for  deciding  whether  there  is  a  K-plan 


Algorithm  bo-plan-  1  -/c(II,  T) 

main 

Plan-FT(so,  k) 

accept 


procedure  Plan-FT(s,  k) 
steps  <r-  0 
while  steps  <  2" 

if  s  |=  G  then  return 
choose  operator  o  s.t.  s  |=  pre(o) 
for  e  G  eff(o) 

(if  F(e)  >  k 


do  < 


do  < 


then  continue  //  under  assumed  k,  e  cannot  happen  here 

else  if  P(e)  >  0 

then  Plan-FT(Res[s;  e],  k  —  P(e)) 

{// proceed  with  the  primal  effect  of  (pre,  eff)  at  s 

steps  <—  steps  +  1 
s  <—  J?es[s;  e] 


reject 


Figure  3:  PS  PACE  algorithm  for  deciding  FT- PLAN-  \-k. 


for  an  FT-planning  task  (II,  T .  n)  with  1-primary  T  and 
|P|  =  n  is  depicted  in  Figure  3.  The  respective  Turing 
machine  is  in  PSPACE  because,  at  any  point,  there  are  at 
most  k  open  calls  to  the  Plan-FT  procedure,  each  storing  a 
single  state  in  n  bits  and  a  single  counter  steps  in  n  bits.  Fi¬ 
nally,  since  it  is  in  PSPACE,  ft-plan-1-k  is  also  PSPACE- 
complete  by  the  PSPACE-hardness  of  classical  planning  un¬ 
der  the  description  language  we  use,  and  equivalence  of  the 
latter  to  FT-PLAN-1-0. 


Theorem  3  poly-ft-plan-1-k  is  in  NP. 

The  proof  is  by  showing  that,  if  (II,  T .  n)  has  a  /c-plan 
7 t'  such  that  all  its  /, '-admissible  executions  reach  the  goal 
after  0(nc )  steps,  then  there  is  a  /c-plan  n  with  the  same 
property  such  that  1 1 7r 1 1  =  0(MK+1)nc(K+2)),  where  b  = 
maxoeo  | eff (o) | .  Since  b  =  0(||II||)  and  both  c  and  k  are 
0(1),  the  rest  stems  from  the  standard  guess-and-verify  ar¬ 
gument  of  NP-membership. 

First,  since  /c-plans  guarantee  goal  reachability  only  along 
executions  with  n  or  fewer  exceptions,  let  it  follow  7 r'  on 
states  reachable  by  the  /c-admissible  executions  of  7r',  and 
make  a  random  operator  choice  everywhere  else.  Thus, 
only  the  /.'-admissible  executions  of  7r  should  be  represented. 
Second,  as  we  upper-bound  the  description  size  of  7r,  for 
simplicity  we  assume  (i)  extensive,  tree-structured  represen¬ 
tation  of  7T,  and  (b)  that  the  range  of  T  is  {0, 1}.  For  0  < 
i  <  n,  let  /( 7r,  i)  be  the  number  of  /-admissible  executions 
of  7 r  that  are  not  ( i  —  1) -admissible.  Clearly,  /(p*) 

is  the  overall  number  of  /.'-admissible  executions,  and  thus 
|  |7r|  |  =  O  ( nc  J2i= o  /(p  *))  •  Since  T  is  1-primary,  there  is 
at  most  one  O-admissible  execution  of  7r  per  possible  initial 
state,  and  thus  f(n,0)  =  1.  Recursively,  due  to  the  same 
argument  of  T  being  1-primary,  each  of  the  0{nc)  opera¬ 
tor  instances  along  that  single  O-admissible  execution  may 
branch  into  b  1-admissible  executions  of  7r.  However,  these 
are  the  only  possible  sources  of  1-admissible  executions. 
Thus,  /(7T,  1)  =  0(/(7r,  0)  •  bnc )  =  0{bnc ),  and  in  general, 


f(n,i)  =  0((bnc)1).  Hence,  ||7r||  =  O  (nc  •  (6nc)K+1)  = 

0(b^K+ 1)nc(K+2)). 

Note  that,  while  k  =  0(1)  should  suffice  for  most  in¬ 
terests  in  practice,  the  PS  PAC  E-membership  result  of  Theo¬ 
rem  2  holds  for  n  =  0(poly(|  |H|  |)).  This  is  not  so,  however, 
with  the  NP-membership  result  of  Theorem  3,  which  relies 
upon  k  =  0(1). 

Compilation  to  Classical  Planning 

Theorems  2  and  3  put  FT-planning  under  1-primary  excep¬ 
tion  models  rather  close  to  classical  planning.  On  the  one 
hand,  that  suggests  that  classical  planning  machinery  can 
possibly  be  adapted  to  solve  such  FT-planning  tasks.  On 
the  other  hand,  the  non-linearity  of  /.'-plans  under  1-primary 
models  seems  to  complicate  applying  classical  planning  al¬ 
gorithms  to  FT-planning. 

We  now  show  that  FT-planning  under  1-primary  mod¬ 
els  can  be  efficiently  compiled  into  classical  planning,  at 
least  as  long  as  the  number  of  non-deterministic  effects 
per  operator  is  bounded  by  a  constant.  The  compilation 
is  to  STRIPS  with  negative  preconditions  and  conditional 
effects.  In  this  formalism,  a  planning  task  is  given  by 
a  quadruple  n  =  (P,  O,  Sq,G),  with  P,  So,  and  G  be¬ 
ing  as  in  our  formalism  for  contingent  planning.  Opera¬ 
tors  o  £  O  are  pairs  (pre(o),  eff(o))  where  the  precondi¬ 
tion  pre(o)  is  a  subset  of  literals  over  P,  and  eff(o)  is  a 
set  of  conditional  effects.  A  conditional  effect  e  is  a  triplet 
(con(e),  add(e),  del(e))  of  condition,  add,  and  delete  lists, 
respectively,  where  con(e)  is  a  subset  of  literals  over  P, 
while  add(e)  and  del(e)  are  subsets  of  propositions.  Op¬ 
erator  o  is  applicable  in  state  s  iff  s  |=  pre(o).  If  o  £  O(s)  is 
applied  in  s,  it  deterministically  changes  the  world  to  state 
res[s;  o]  =  Ueeeff(0),s|=con(e)  (s  \  del(e))  u  add(e). 

We  start  with  a  compilation  of  a  simple  fragment  of 
FT-PLAN-l-zc,  corresponding  to  FT-planning  tasks  (n  = 
(P,  0,s o,G),.F, /c)  such  that  (i)  each  operator  of  n  has 
at  most  two  effects,  and  (ii)  T  is  a  normative  1-primary 
exception  model  for  n  such  that,  for  each  o  £  O(s),  if 
eff(o)  =  {eo},  then  P(eo)  =  0,  and  if  eff(o)  =  {eo,ei}, 
then  P(eo)  =  0  and  P(ei)  =  1.  We  begin  with  an  example 
that  illustrates  the  basic  idea  behind  this  compilation.  Let 
k  =  2,  and  let  Figure  4a  depict  an  irreducible  contingent 
plan  7 r  for  IJfPKb  the  arcs  correspond  to  the  operator  ef¬ 
fects  and  are  labeled  with  the  respective  values  of  P,  and 
double-frame  states  are  the  goal  states. 

Note  that,  in  Figure  4a,  the  states  and  operator  instances  in 
7 r  are  numbered  consistently  with  a  DFS  traversal  of  the  exe¬ 
cution  tree  Tv.  Therefore,  the  operator  sequence  cq, . . . ,  07 
induces  a  sequence  of  policies  no,  -  ■■  ,^7  for  n('F’K^  such 
that  7To  is  an  empty  policy,  777  =  7r,  and  each  7 r.;  extends 
n.i_i  with  mapping  a  single  leaf  of  Tn._1  to  operator  cq.  An 
important  property  of  this  sequence  of  policies  7To, . . . ,  777  is 
that,  for  0  <  j  <  2,  each  7 r;  induces  at  most  one  execution 
p  with  P(p)  =  j  that  does  not  achieve  the  goal  within  Tn. . 
The  latter  is  emphasized  by  the  tabular  representation  of  this 
sequence  of  policies  in  Figure  4b.  The  columns  in  the  table 
capture  certain  subsets  oq,  . ,  a7  of  leaves  of  Tno , ,  Tn? , 
that  is,  of  the  end-states  of  7To,...,7T7,  respectively.  For 


counterparts  in  Pt.  For  example,  [{79,  — '<z}J =  {Pi,^qi}, 
and  [{Pi,^qi}\j  =  {pj,^qj}. 

Operators.  O'  contains  k  +  1  sets  of  operators 
Oo,  ■  ■■ ,  Ok,  with  Oi  =  {oi  |  o  €  O}.  For  each  o  €  O,  the 
precondition  of  the  operator  Oi  is 

pre(oi)  =  LPre(o)Ji  U  {openj  U  (J  {-.open^}, 

j=i+ 1 

In  other  words,  if  state  a  of  IT  represents  a  policy  for  Il^’O 
such  that  some  admissible  executions  of  that  policy  do  not 
reach  the  goal,  then  the  planner  is  forced  to  extend  the  i- 
admissible  such  execution  with  the  highest  index  i.  If  either 
o  is  deterministic  or  i  =  k,  then  eff(oJ  =  Res[a / Pf,  |_eoJJ. 
Otherwise,  if  eff(o)  =  {eo,  ei}  and  i  <  k,  then 

eff(o*)  =  Res[a/Pi\  [eoJJ  A  [Res[a/Pi;  [eijjj  ,+1  A  openi+1. 


Figure  4:  A  plan  n  for  I F~F''j  and  the  end-state  set  repre¬ 
sentation  of  the  induced  sequence  of  7r’s  sub-policies. 

0  <  j  <  2,  each  a,  contains  at  most  one  state  with  “j  fail¬ 
ures  so  far,”  denoted  with  cr, (j )  =  T  denoting  that  cr, 

does  not  contain  a  state  for  j.  For  i  =  0,  cro(0)  =  (so,  0), 
and  <t0(1)  =  (To (2)  =  T.  For  i  >  0,  if  Tni  has  a  (unique) 
leaf  (s,j)  such  that  s  ^  Sg,  then  <Ji(j)  =  ( s,j ),  and  other¬ 
wise,  CTj (j )  =  (Tj _ ]  (j).  Asterisks  in  the  table  are  by  the  goal 

states,  and  note  that  the  last  set  <77  is  the  first  in  the  sequence 
to  contain  only  goal  states. 

It  turns  out  that  any  irreducible  At-plan  for  any  FT- 
planning  task  from  the  fragment  in  question  induces  such 
a  DFS-ordered  sequence  of  sub-policies  with  “at  most  one 
non-goal  leaf  with  j  failures  so  far.”  It  is  precisely  this 
property  that  provides  a  basis  for  our  compilation  of  (II  = 
(P,  O,  So,  G),  JF ,  k)  to  an  equivalent  classical  planning  task 
IF  =  (P1 ,  O' ,  a'(j.  G').  For  now,  we  postpone  the  formal 
statement  and  proof  of  this  property,  and  formulate  the  ac¬ 
tual  compilation.  After  that,  in  Lemma  5,  we  formally 
state  the  task  properties  underlying  the  compilation,  and  use 
this  lemma  to  specify  compilation  for  a  wider  class  of  FT- 
planning  tasks. 

Propositions.  For  clarity,  we  use  letters  s  and  er  to  de¬ 
note  states  of  II  and  IF,  respectively.  The  subsets  op , ....  ay 
of  in  Figure  4b  hint  at  the  state  space  structure  of 

IF:  Each  reachable  state  <7  of  II'  represents  a  partial  pol¬ 
icy  7 Ta  for  I  r:  P't  !  that  corresponds  to  a  concrete  stage  of  a 
certain  DFS  traversal  of  the  purported  tree-structured  plan 
for  IFJC».  To  support  that,  P'  contains  k  +  1  replicas 
/  ji, ....  PK  of  P.  as  well  as  a  set  of  auxiliary  propositions 
{opetij  }£_q.  The  interpretation  of  opei £  <7  is  that  “the 
policy  for  I represented  by  a  induces  an  /'-admissible 
execution  that  does  not  achieve  the  goal,  and  the  end-state 
of  that  execution  is  captured  by  the  values  of  Pi  in  a.”  In 
what  follows,  by  a /Pi  we  denote  the  valuation  provided  by 
a  to  propositions  P,  .  Likewise,  if  0  is  a  set  of  literals  either 
over  P  or  over  one  of  the  proposition  sets  /J() . . . . ,  PK  of  IF, 
then,  for  0  <  i  <  n,  [4>\  i  is  a  set  of  literals  over  Pi,  obtained 
from  (f>  by  replacing  all  the  propositional  symbols  with  their 


In  the  formalism  of  our  choice,  such  operator  effects  are  cap¬ 
tured  as  follows.  If  eff(o)  =  {eo}  or  i  =  n,  then  eff(c>i) 
contains  a  single  unconditional  effect: 


eff(oi)  =  {  (0,  Ladd(e0)Ji,Idel(e0)Ji)  }. 
Otherwise,  if  eff(o)  =  {eo,  ei}  and  i  <  k,  then 


eff(oi)  =  U 
where 


(0,  Ladd(e0)Ji, 

(0,  {op«]!+1}, 

(0,  {add(ei)Ji+1 , 


Lde|(e0)Ji 

0 

|de!(ei)Ji+1 


>, 

>, 

> 


*<  -  u  { 

p^add(ei  )Udel(ei ) 


({Pi}, 


{Pi+i},  0 

0,  {Pi+ 1} 


), 

> 


compactly  encodes  the  situation  calculus  frame  axioms 
between  the  “current  situation  with  i  failures  so  far”  and 
the  “next  situation  with  i  +  1  failures  so  far.”  In  addition, 
O'  contains  a  set  of  auxiliary  “goal-achieving”  operators 

R}?=  0 with 

pre(o*)  =  [G\i  U  {openj  U  [J  {-.open^-}, 

j=i+ 1 

eff(oO  =  {  (0,0,  {openj)  }  . 


Initial  state  and  goal.  The  Po-part  of  the  initial  state  ctq 
captures  the  sole  initial  state  of  II,  and  for  i  >  0,  the  Po¬ 
parts  of  the  initial  state  are  actually  not  important,  and  can 
be  set  arbitrarily.  The  auxiliary  variable  open0  is  initially 
set  to  true,  the  rest  of  the  variables  openi  are  initially  set  to 
false,  and  the  goal  of  II'  is  to  negate  openQ.  That  is,  a'0  = 
LsoJ0U  {open0}  and  G'  =  {^ openQj . 

For  an  illustration,  consider  a  small  and  simplified  vari¬ 
ant  of  our  running  example  in  which  there  are  only  two 
locations,  x  and  not(x),  the  robot  and  a  single  spare 
tire  are  initially  at  x,  and  the  goal  is  for  the  robot  to 
be  at  not(x).  Movement  of  the  robot  either  succeeds 
(which  is  the  primary  effect  of  that  operator),  or  fails, 
with  the  robot  staying  at  the  original  location  with  a  flat 
tire.  This  planning  task  II  is  encoded  using  propositions 
P  =  { x ,  noflat,  spare}  by  the  operators  as  in  the  Ta¬ 
ble  la,  initial  state  s 0  =  {2;,  spare,  noflat},  and  goal  G  = 


o 

pre 

eff 

j7 

move 

{a;,  noflat} 

e0  =  (0,  {rc}) 
d  =  (0,  {noflat}) 

J^eo)  —  0 
.F(ei)  =  1 

fix 

{a;,  spare} 

eg  =  ({noflat} ,  {spare}) 

-F(eo)  =  o 

o 

pre 

eff 

moveo 

[  xo’  ) 

\  noflato  i  1 

\  openQ ,  | 

l  ^opei J 

<0,  0,  {*o}  )' 

(0,  {opettj } ,  0  ) 

(0,  0,  {noflati}) 

<{*0},  {*!},  0  ) 
<{-*0>,  0,  -DiI  > 

( {  spare0  } ,  {spa re^,  0  ) 

.  < {  — 1  spartjQ } ,  0,  {sparej)  . 

> 

movei 

{aii ,  no  flat  i ,  open±  } 

{<0,0, DuM 

fixo 

{ x0,spare0 , 
open0,  — iopen1} 

{(0,  {noflato},  { spare0 } ) } 

fixp 

{aii ,  spare1 ,  open1 } 

{(0,  {noflati},  {sparej})} 

°0 

{ ~ 'Xo  j  open0,  — lopei^} 

{(0,0,{openo})} 

°1 

{  — «aii ,  ope/ij^} 

{<0,  0,  {ope/ij})} 

Table  1:  Operators  from  the  compilation  example. 


{-ix}.  The  compilation  FT  =  (P' ,0' ,<j'0,G')  of  the  FT- 
planning  task  (II,  P,  1)  is  defined  over  propositions  P'  = 
Uie{o  i}!2-*;  noffati,  spare, ,  open^},  and  operators  as  in  Ta¬ 
ble  lb.  The  initial  state  is  cr'Q  =  {xo,  spare0,  noflato,  open0}, 
and  the  goal  is  G  =  { -opcn(l } .  It  is  not  hard  to  verify  that 
7r  =  (moveo,  fix1;  move!,  o\,  Og)  is  (the  only)  plan  for  the 
classical  planning  task  II',  and  that  the  respective  contingent 
plan  for  n^’1)  can  be  decoded  from  7r  in  linear  time. 

In  the  spirit  of  this  example,  we  have  generated  a  set  of 
tasks  in  which  a  robot  should  move  from  the  bottom-left  to 
the  top-right  corner  of  a  4-connected  grid,  in  which  some 
of  the  edges  are  “safe,”  with  moves  along  them  being  de¬ 
terministic,  while  other  edges  are  “unsafe,”  with  moves  on 
them  either  succeeding  (which  is  the  primary,  expected  ef¬ 
fect  of  that  operator)  or  resulting  in  the  robot  getting  a  flat 
and  staying  where  it  was.  A  limited  number  of  spare  tires 
are  available  on  some  nodes  of  the  grid.  The  six  sets  of  five 
tasks  each  correspond  to  5  x  5  and  7x7  grids,  with  each  edge 
of  the  grid  being  independently  marked  as  safe  with  proba¬ 
bility  p  £  {0.1, 0.2,  0.5},  and  10  spare  tires,  independently 
positioned  on  the  grid  nodes  at  random. 

The  runtimes  of  different  approaches  on  these  tasks  are 
depicted  in  Table  2.  The  three  approaches  we  examined 
were  (col.  2)  contingent  planning  with  Contingent-FF  (Hoff¬ 
mann  and  Brafman  2005);  (col.  3-6)  FT-planning  with 
Contingent-FF  over  (P,  n) -reformulations,  k  £  {0, 1, 2, 4}, 
and  (col.  7-10)  FT-planning  with  Fast  Downward’s  GBFS 
with  FF  heuristic  over  the  classical  planning  reductions  of 
the  (P,  k) -reformulations.  Each  task/planner  was  given  a  10 
minute  time  limit;  cases  in  which  the  planner  neither  solved 
the  task  nor  proved  it  unsolvable  within  the  time  bound  are 
marked  with  If  a  planner  solved  a  task  within  the  time 
bound,  then  the  respective  entry  in  the  table  is  shaded. 

As  Table  2  shows,  all  but  one  of  these  tasks  were  proven 
by  Contingent-FF  to  have  no  strong  contingent  plans  (and 
cyclic  contingent  plans  are  also  of  no  help  in  this  domain), 
while  all  (effectively  classical)  FT-planning  tasks  with  n  =  0 
were  easily  solved  by  both  Contingent-FF  and  Fast  Down¬ 


CFF 

CFF(n(X,«)) 

FD(n') 

task 

0 

1 

2 

4 

0 

1 

2 

4 

5x5  (0.1) 

0.08 

0.00 

- 

- 

- 

0.00 

0.10 

0.02 

0.04 

0.08 

0.00 

- 

- 

- 

0.00 

0.31 

- 

0.13 

0.08 

0.00 

- 

- 

- 

0.00 

0.10 

- 

0.13 

0.08 

0.00 

- 

- 

- 

0.00 

0.00 

0.01 

0.03 

0.08 

0.00 

4.99 

- 

- 

0.00 

0.01 

0.01 

0.03 

5x5  (0.2) 

0.08 

0.00 

0.13 

- 

- 

0.00 

0.01 

0.03 

0.06 

0.08 

0.00 

- 

- 

- 

0.00 

0.19 

0.02 

0.04 

0.08 

0.00 

- 

- 

- 

0.00 

0.01 

0.01 

0.03 

0.08 

0.00 

- 

- 

- 

0.00 

0.00 

0.01 

0.03 

0.08 

0.00 

1.28 

2.59 

- 

0.00 

0.70 

- 

0.21 

5x5  (0.5) 

0.08 

0.00 

- 

- 

- 

0.00 

7.50 

0.02 

0.04 

0.08 

0.00 

0.55 

0.59 

0.79 

0.00 

0.01 

0.18 

106.53 

0.08 

0.00 

0.12 

5.04 

5.43 

0.00 

0.01 

0.02 

0.04 

0.08 

0.00 

- 

- 

- 

0.00 

0.01 

310.59 

- 

0.07 

0.00 

- 

- 

- 

0.00 

0.01 

0.02 

0.04 

7x7  (0.1) 

0.12 

0.00 

- 

- 

- 

0.00 

0.02 

0.03 

0.06 

0.13 

0.00 

2.10 

- 

- 

0.00 

1.67 

0.04 

0.07 

0.13 

0.00 

- 

- 

- 

0.00 

0.21 

0.03 

0.07 

0.13 

0.00 

- 

- 

- 

0.00 

0.02 

0.03 

0.06 

0.13 

0.00 

- 

- 

- 

0.00 

0.09 

0.04 

0.07 

7x7  (0.2) 

0.13 

0.00 

- 

- 

- 

0.00 

27.32 

0.04 

0.08 

0.13 

0.00 

- 

- 

- 

0.00 

0.01 

0.03 

0.06 

0.13 

0.00 

- 

- 

- 

0.00 

0.02 

0.03 

0.06 

0.13 

0.00 

- 

- 

- 

0.00 

0.01 

0.03 

0.06 

0.13 

0.00 

- 

- 

- 

0.00 

5.96 

0.04 

0.07 

7x7  (0.5) 

0.13 

0.00 

- 

- 

- 

0.00 

0.38 

0.05 

0.09 

0.13 

0.00 

3.32 

4.13 

- 

0.00 

0.04 

0.63 

11.56 

0.13 

0.00 

- 

- 

- 

0.00 

0.31 

38.86 

- 

0.13 

0.00 

0.14 

0.15 

0.15 

0.00 

0.01 

0.03 

0.06 

0.13 

0.00 

- 

- 

- 

0.00 

0.89 

17.37 

1.25 

Table  2:  Planner  runtimes  on  different  formulations  of  FT- 
planning  tasks  in  the  spirit  of  our  example. 

ward.  For  us,  of  course,  the  interesting  part  was  in  between 
these  two  extremes,  and  both  Contingent-FF  and  Fast  Down¬ 
ward  found  non-trivial  K-plans  for  numerous  tasks  here. 
In  terms  of  performance,  compiling  the  contingent  ( P ,  n)- 
reformulations  to  classical  planning  strictly  dominated  solv¬ 
ing  the  former  directly,  in  terms  of  the  coverage  of  both  solv¬ 
able  and  unsolvable  FT-planning  tasks.  In  sum,  the  direction 
of  compiling  FT-planning  tasks  to  classical  planning  appears 
promising,  and  clearly  deserves  further  investigation. 

In  Lemma  5  below,  we  now  formalize  the  properties  of 
FT-PLAN-1-k  that  are  exploited  by  the  compilation  of  its 
fragment  above.  In  particular,  this  lemma  allows  for  extend¬ 
ing  this  compilation  scheme  to  arbitrary  fixed  bounds  on  the 
number  of  non-deterministic  effects  per  operator,  as  well  as 
to  arbitrary  normative  1 -primary  exception  models. 

Lemma  4  Let  (n  =  (P,  O,  so,  G) .  P .  k)  be  an  FT-planning 
task  with  a  1-primary  model  P,  and  maxoeo  eff(o)  =  b.  If 
7r  is  an  irreducible  contingent  plan  for  then  there 

exists  a  set  of  policies  t To, ,  7rm  over  such  that 

(1)  7To  is  an  empty  policy,  7rm  =  tt,  and  each  tti  extends 
tti-i  by  prescribing  an  action  for  a  single  additional 
state  ofll^'G  such  that  the  execution  tree  T7T._  1  is  a 
proper  sub-tree  ofTni,  and 

(2)  for  0  <  i  <  m  and  0  <  j  <  n,  7r,;  induces  at  most  b 
executions  p  that  do  not  achieve  the  goal  within  Tni  and 
have  P(p)  =  j- 

The  proof  is  as  follows.  Let  n  be  an  irreducible  contin¬ 
gent  plan  for  the  (P,  «)-reformulation  of  (n,  P,  k)  as  in 
the  claim,  and  let  {(si,  k\), . . . ,  (sm,  km)}  be  a  relabeling 

of  the  nodes  Sf  ’  ’  consistently  with  the  order  in  which 


where 


they  are  expanded  by  a  depth-first  traversal  of  T„,  with  the 
“depth”  of  a  node  (s,  k)  being  given  by  k.  Given  that,  let  a 
sequence  of  policies  M  =  ttq,  ■  ■  ■ ,  7rm  be  defined  as 


7Ti(Sj,  kj) 


K(Sj,kj),  j<i 

undefined,  j  >  i 


It  is  immediate  that  M  satisfies  condition  (1)  of  the  lemma, 
and  so  what  remains  to  be  shown  is  condition  (2).  The 
proof  is  by  induction  on  i.  For  i  =  0,  the  condition  is 
trivially  satisfied  since  7To  is  empty.  Assuming  that  the 
condition  is  satisfied  for  i  —  1  >  0,  the  proof  for  i  is 
as  follows.  By  the  DFS  construction  of  M,  we  have 

Si rf’K)  =  U  {(si,ki)},  where  is  a  non-goal 

leaf  node  in  T7r  ._1 .  Furthermore,  for  all  other  non-goal  leaf 
nodes  ( Sj ,  kj)  of  T7r._1 ,  it  holds  that  kj  <  ki ,  or  otherwise 
DFS  would  expand  (, Sj,kj )  prior  to  ( Si,kj ).  Given  that, 
consider  the  extension  of  T7r._1  to  Tn.  by  n(sj,  ki). 

By  the  definition  of  exception  models,  for  each  k  <  ki, 
the  number  of  executions  p  of  tt,  that  do  not  achieve  the 
goal  within  Tni  and  have  P(p)  =  k  is  the  same  as  for  7r,_i, 
and  this  because  the  number  of  “exceptions  so  far”  cannot 
decrease  with  the  progress  of  the  execution.  For  k  =  ki, 
since  T  is  1-primary,  7r \  replaces  a  single  execution  p  of 
7 Ti_i  that  does  not  achieve  the  goal  within  Tn._ j  and  has 
P{p)  =  ki,  with  at  most  one  such  execution,  namely  the 
one  that  extends  p  with  the  sole  primary  effect  of  7 r(sj,  ki). 
Finally,  for  all  k  >  hi,  there  are  no  executions  of  7r,_.  \  that 
do  not  achieve  the  goal  within  Tn._1  and  have  P(p)  =  k, 
and  thus  there  are  at  most  b  such  executions  of  ir,  . 

Given  an  FT-planning  task  (II  =  (P,0,Sq,G),P,  k) 
with  a  normative  1-primary  model  P  and  maxoeo  eff(o)  = 
0(1),  a  polynomial-time,  sound,  and  complete  compi¬ 
lation  of  (II,  T ,  k)  to  a  classical  planning  task  IT  = 
(P',  O' .  ctq,  G)  is  specified  below.  For  ease  of  presentation, 
for  each  operator  o  £  O(s),  if  eff(o)  =  {eo,  ei, . . .  e^}, 
then  .F(eo)  =  0.  The  set  of  propositions  P'  contains 
n(b— 1)  +  1  replicas  of  propositions  P,  as  well  as  n(b— 1)  +  1 
auxiliary  propositions  open ,  denoted  as 


P'  =  P0,o  U  {open0  0}  U  (J  PtJ  U  {open^}. 


The  set  of  operators  O'  contains  n{b  —  1)  +  1  sets  of  op¬ 
erators  O0)  o,  Oi,i,  •  ■  • ,  Oi,b-i,  •  •  • ,  0Kp, ...,  6_i,  with 

Oi.j  =  {oij  |  o  £  O }.  For  each  o  £  O,  the  precondition  of 
the  operator  o,.j  is 

Pre(°ij)  =  LPre(0)Jfj  U  {openly  U  (J  {->openi  x} 


u  U 

y=i-\- 1  x=l 

If  eff(o)  =  {eo,  ei, . . .  ep},  then 


p^add(ex  )Udel(ea. ) 


{Pi+F(ex),x}  $  ),\ 


Likewise,  O'  contains  a  set  of  auxiliary  operators 

{°0,0>  °1,15  •  •  •  )  ■  j  °K,1>  •  •  •  j 


b-1 

PreKj)  =  [G\tJ  U  { openi:j }  U  (J  {-.openijX} 

X=j  + 1 


K  6-1 

U  U  U  {^°Peny,J: 

y=i+ 1 x=l 

eff(o*)  ={(0,0,  {open,.  j})}. 


Finally,  as  in  the  basic  case,  the  initial  state  and  goal  are 
specified  as  a'0  =  |.SoJ0  0U{open0  0}  and  G  =  {=open0  0}. 


Theorem  5  Let  (II  =,  T .  k)  be  an  FT-planning  task  with  a 
1-primary  model  T,  and  IT  be  the  compilation  of  II.  There 
is  a  bijective,  efficiently  computable  mapping  between  the 
irreducible  plans  for  FT-T^  and  those  for  IT. 

Summary 

We  studied  computational  properties  of  fault  tolerant  plan¬ 
ning,  a  simple  and  natural  planning  formalism  that  interpo¬ 
lates  between  contingent  and  classical  planning.  We  showed 
that  an  important  spot  along  this  interpolation  exhibits  at¬ 
tractive  worst-case  time  complexity,  and  for  most,  can  be 
efficiently  compiled  to  classical  planning  in  a  sound,  com¬ 
plete,  and  practicable  manner. 

The  palette  of  possible  (and  impossible)  extensions  to  FT- 
planning  that  call  for  investigation  is  wide.  For  instance,  it 
is  possibly  more  natural  in  some  contexts  to  assume  bounds 
on  the  number  of  failures  per  operator,  and  not  on  the  num¬ 
ber  of  failures  overall,  as  we  do  here.  It  is  easy  to  show  that 
this  type  of  assumption-based  planning  is  also  in  PS  PACE, 
but  our  more  practicable  results  here  ( N  P-membership  for 
problems  with  polynomially-long  executions  and  the  spe¬ 
cific  compilation  scheme  to  classical  planning)  do  not  ap¬ 
ply  there.  Also,  for  simplicity,  throughout  the  paper  we  as¬ 
sumed  a  single  (aka  fully  known  a  priori)  initial  state.  It  is 
not  hard  to  verify,  however,  that  our  PSPACE-membership, 
N  P-membership,  and  compilation  results  can  be  straightfor¬ 
wardly  extended  to  arbitrary,  polynomial,  and  fixed  numbers 
of  possible  initial  states,  respectively,  as  long  as  listing  these 
possible  initial  states  does  not  introduce  further  complex¬ 
ity.  Finally,  we  believe  that  FT-planning  to  classical  plan¬ 
ning  compilation  can  be  substantially  stratified  by  exploiting 
the  structure  of  the  FT-planning  tasks,  similarly  to  the  way 
the  structure  of  the  tasks  is  exploited  in  recent  compilations 
from  conformant  to  classical  planning  (Palacios  and  Geffner 
2009). 


ef 'h,j  ={<0,  [add(e0)Jij  ,  Ldel(e0)Ji ,,)}  U 

|  |  f  (0,{openi+^(ex)iJ,  0  >,  1 

U  \  <0,[add(ei)Ji+jr(  }  ,Ldel(ei)Ji+jr(  }  )  / 

l<cc<6  , 
i+Jr(e£C)</c 
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Monte- Carlo  Tree  Search:  To  MC  or  to  DP? 


Zohar  Feldman  and  Carmel  Domshlak1 


Abstract.  State-of-the-art  Monte-Carlo  tree  search  algorithms  can 
be  parametrized  with  any  of  the  two  information  updating  proce¬ 
dures:  MC-backup  and  DP-backup.  The  dynamics  of  these  two  pro¬ 
cedures  is  very  different,  and  so  far,  their  relative  pros  and  cons  have 
been  poorly  understood.  Formally  analyzing  the  dependency  of  MC- 
and  DP-backups  on  various  MDP  parameters,  we  reveal  numerous 
important  issues  that  get  hidden  by  the  worst-case  bounds  on  the  al¬ 
gorithm  performance,  and  reconfirm  these  findings  by  a  systematic 
experimental  test. 

1  INTRODUCTION 

Markov  decision  processes  (MDPs)  is  a  standard  model  for  plan¬ 
ning  under  uncertainty  [17],  An  MDP  (S,A,¥,  R)  is  defined  by  a 
set  of  states  S,  a  set  of  state  transforming  actions  A,  a  stochastic 
transition  function  P:SxAxS->  [0, 1],  and  a  reward  function 
fl:SxixS->l.  The  states  are  fully  observable  and,  in  the 
finite  horizon  setting  considered  here,  the  rewards  are  accumulated 
over  some  predefined  number  of  steps  H.  The  objective  of  planning 
in  MDPs  is  to  sequentially  choose  actions  so  as  to  maximize  the  ac¬ 
cumulated  reward.  The  representation  of  large-scale  MDPs  can  be  ei¬ 
ther  declarative  or  generative,  but  anyway  concise,  and  allowing  for 
simulated  execution  of  all  feasible  action  sequences,  from  any  state 
of  the  MDP.  In  online  MDP  planning,  the  agent  focuses  on  its  current 
state  so  only,  deliberates  about  the  set  of  possible  policies  from  that 
state  onwards  and,  when  interrupted,  chooses  what  action  to  perform 
next.  In  formal  analysis  of  algorithms  for  online  MDP  planning,  the 
quality  of  the  action  a,  chosen  for  so  with  H  steps-to-go,  is  assessed 
in  terms  of  the  induced  “simple  regret”,  capturing  the  performance 
loss  that  results  from  taking  a  and  then  following  an  optimal  policy 
7r*  for  the  remaining  H  —  1  steps,  instead  of  following  7r*  from  the 
beginning  [4], 

Many  popular  algorithms  for  online  MDP  planning  constitute 
what  is  called  Monte-Carlo  tree  search  ( MCTS )  [21,  16,  15,  7,  6, 
19,  22,  14,  10,  12],  and  adaptations  of  some  of  these  algorithms 
are  also  popular  in  other  settings  of  sequential  decision  making, 
including  those  with  partial  state  observability  and  adversarial  ef¬ 
fects  [13,  20,  2,  8,  3],  At  a  high  level,  all  MCTS  algorithms  explore 
the  state-space  region  around  so  by  iteratively  (i)  simulating  an  ac¬ 
tion/state  trajectory  from  so,  and  (ii)  using  the  outcome  of  that  tra¬ 
jectory  to  update  various  action-value  estimates  related  to  the  state- 
space  region  of  interest,  as  well  as  to  update  the  estimate  of  what  ac¬ 
tion  should  be  best  applied  at  state  so .  In  that  respect,  specific  MCTS 
algorithms  differ  both  in  their  trajectory  rollout  strategies,  as  well  as 
in  their  rollout-based  update  strategies. 

Recent  work  substantially  advanced  our  understanding  of  how  the 
performance  of  MCTS  depends  on  the  specifics  of  the  rollout  strat¬ 
egy,  as  well  as  on  the  choice  of  what  pieces  of  information  should  be 
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updated  based  on  a  given  rollout  [  15,  7,  91.  Recently,  however,  Keller 
&  Helmert  [14]  demonstrated  empirically  that  the  performance  of 
MCTS  also  depends  to  a  large  extent  on  how  the  respective  up¬ 
dates  are  being  performed.  Prior  to  the  work  of  Keller  &  Helmert, 
updates  in  MCTS  algorithms  were  all  based  on  MC-backups,  that 
is,  sample  averaging  updates  of  the  selected  action-value  estimates. 
Keller  &  Helmert  showed  that  modifying  a  standard  MCTS  algo¬ 
rithm  (such  as  UCT  [15]),  by  replacing  its  MC-backups  with  dy¬ 
namic  programming  estimates  propagation  a  la  Bellman  backups  in 
value  iteration  [1],  can  substantially  affect  the  performance,  and,  at 
least  in  their  experiments,  typically  in  favor  of  DP-backups.  Later  on, 
Feldman  &  Domshlak  [  1 21  showed  that  switching  from  MC-backups 
to  such  DP-backups  preserves  the  order-of-magnitude  convergence 
rates  of  MCTS  instances  that  guarantee  exponential  rate  performance 
improvement  (such  as  BRUE  [9]),  and  can  even  allow  for  proving 
somewhat  better  convergence  bounds.  Still,  the  relative  pros  and  cons 
of  MC-  and  DP-backups  have  not  been  systematically  studied  so  far, 
and  thus  are  still  poorly  understood. 

This  is  precisely  our  contribution  in  this  paper:  Using  BRUE  and 
MaxBRUE,  a  pair  of  state-of-the-art  MCTS  algorithms  that,  ceteris 
paribus,  use  MC-backups  and  DP-backups,  respectively,  we  study 
the  dynamics  of  MC-  and  DP-backups  both  formally  and  empirically. 
Starting  with  establishing  a  pair  of  comparable  worst-case  bounds 
on  the  convergence  rates  of  these  two  algorithms,  we  use  the  anal¬ 
ysis  behind  these  bounds  to  examine  specific  dependencies  of  the 
two  algorithms  on  various  MDP  parameters,  namely  the  state  and 
action  branching  factors,  the  shape  of  the  reward  function,  and  the 
entropy  of  the  transition  function.  To  our  knowledge,  our  analysis  is 
first  of  its  kind,  and  it  reveals  numerous  important  issues  that  get  hid¬ 
den  by  the  worst-case  bounds  due  to  certain  deficiencies  in  the  for¬ 
mal  worst-case  analysis  of  MC-backups.  In  particular,  it  suggests  that 
MC-backups  are  less  sensitive  than  DP-backups  to  the  state  branch¬ 
ing  factor,  especially  when  the  transition  function  concentrates  on 
a  small  number  of  outcomes,  as  it  is  typically  the  case  in  practical 
applications.  The  various  aspects  of  the  analysis  are  then  put  on  a 
systematic  experimental  test,  which  reconfirms  its  key  findings. 

2  BACKGROUND 

In  what  follows,  we  adopt  the  notation  and  pseudo-code  convention 
from  [121.  In  particular,  when  considering  an  MDP  (S,  A,P,  R), 
its  state  and  action  branching  factors  are  respectively  denoted  by 
K  =  maxs  |A(s)|  and  B  =  maxs,0  |{s'  |  P(s'|s,  a)  >  0} |,  s{h ) 
denotes  state  s  6  S  with  h  steps-to-go,  and  A(s)  C  A  denotes  the 
actions  applicable  in  state  s.  Some  auxiliary  notation:  The  operation 
of  drawing  a  sample  from  a  distribution  V  over  set  R  is  denoted  by 
~  2?[R],  U  denotes  uniform  distribution,  and  [nfl  for  n  6  N  denotes 
the  set  {1, . . . ,  n}.  For  a  sequence  of  tuples  p,  p[i]  denotes  the  i-th 
tuple  along  p,  and  p[i\.x  denotes  the  value  of  the  field  x  in  that  tuple. 


MCTS:  [input:  (S,  A,  Tr,  R);s0  e  S] 

while  time  permits  do 
p  ROLLOUT 
UPDATE(p) 

return  arg  maxa  Q(sq{H)  ,  a) 

procedure  ROLLOUT 

p  4—  ()  ;  s  <—  so;  d  <—  0 
while  not  stop-rollout(p)  do 

a  ROLLOUT- ACTION (s(H  —  d )) 

s'  rollout-outcome(s<//  -  d),a ) 
r  <—  R  (s,  a,  s  j 
p  [£]  ■<—  (s,  a,  r,  s  ') 
s  <—  s';  d  <—  d  - f-  1 

return  p 


(a)  MCTS 


procedure  UPDATE(p) 
for  d  f-  |p| ,  .  .  .  .  1  do 
h<r-  H  -  d 
(s,  a,  r,  s')  <—  p[d ] 
n(s(h))  <—  n(s(h))  +  1 
n(s(h) ,  a)  <—  n(s(h),  a)  +  1 
n(s(h),  a ,  s')  n(s(h),  a,  s')  +  1 

f  ■<—  r  +  ESTIMATE  (s'  (h  —  1)) 
MC-BACKUP (s(h),  a,  f) 

procedure  ESTlMATE(s(/i)) 
f  <—  0 

for  d  4—  0,  .  .  .  ,  h  —  1  do 

a  <r-  EST-ACTION(s(/i  —  d)) 
s'  EST-OUTCOME(s(/i  —  d),  a) 
rd+ 1  R  (s,  a,  s') 
f  •<-  f  +  rd+i 
s  -f-  s' 
return  r 


procedure  MC-BACKUP(s(/i),  a,  r ) 
Q(s(h),  a)  •*- 


(b)  MC 


procedure  update(p) 
for  d  <—  |p|, .  .  .  ,  1  do 
h  <-  H  -  d 
a  <—  p[d].a 
s'  4—  p[d].s' 
n(s(h))  «—  n(s(h))  +  1 
n(s(/i),  a)  ■<—  n(s(h),  a)  +  1 
n(s(/i),  a,  s')  ■<—  n(s(h),  a,  s')  +  1 
R(s(h),  a)  =  R(s(h),  a)  +  p[d].r 
DP-BACKUP  (s(h),a) 

procedure  DP-BACKUP(s(/i) ,  a) 

Q(s(h),a)  •<—  n(t(h)l2) 

V  i —  0 

for  s'  €  {s'  |  n(s(/i),  a,  s')  >  0}  do 

v  <—  v  +  ^  "  max^  Q(s'(h,-1),  a') 

Q(s(h ),  a)  X—  Q(s{h),  a)  +  u 


(c)  DP 


Figure  1.  (a)  Monte-Carlo  tree  search  general  scheme,  with  “separation  of  concerns”  versions  of  (b)  MC-backup  and  (c)  DP-backup  updates 


MCTS,  a  canonical  scheme  underlying  various  MCTS  algorithms 
for  online  MDP  planning,  is  depicted  in  Figure  1.  MCTS  explores 
the  state  space  in  the  radius  of  H  steps  from  the  initial  state  so  by 
iteratively  issuing  simulated  ROLLOUTS  from  so.  Each  such  rollout 
p  comprises  a  sequence  of  simulated  steps  { s ,  a,  r,  s'),  where  s  is  a 
state,  a  is  an  action  applicable  in  s,  r  is  an  immediate  reward  col¬ 
lected  from  issuing  the  action  a,  and  s'  is  the  resulting  state.  Once 
generated,  the  rollout  is  used  to  UPDATE  some  variables  of  interest, 
typically  including  at  least  the  action  value  estimators  Q(s{h),a) 
and  the  counters  n{s{h) ,  a)  that  record  the  number  of  times  the  cor¬ 
responding  estimators  Q(s(h),a)  have  been  updated.  Once  inter¬ 
rupted,  MCTS  uses  the  information  collected  throughout  the  explo¬ 
ration  to  recommend  an  action  to  perform  at  state  so- 

Instances  of  MCTS  vary  mostly  along  their  ROLLOUT-ACTION 
policies,  prescribing  the  action  to  apply  in  the  current  state  of  the  roll¬ 
out;  and  their  UPDATE  strategies,  specifying  (i)  which  of  the  main¬ 
tained  variables  should  be  updated  based  on  the  rollout,  as  well  as 
(ii)  how  those  variables  should  be  updated.  The  “how”  aspect  of  the 
MCTS  UPDATE  procedures  is  of  our  focus  here.  By  decoupling  be¬ 
tween  the  decisions  of  what  to  update  and  how  to  update,  the  empha¬ 
sized  text  in  Figures  lb  and  lc  shows  the  respective  subroutines  for 
MC-backup  and  DP-backup,  the  two  alternatives  for  “how  to  update" 
that  are  in  use  these  days  by  various  MCTS  algorithms. 

•  MC-backups  are  based  on  the  principle  of  averaging  random  vari¬ 
able  samples:  Given  a  new  value  sample  r  for  an  action  a  at  s(h), 
f  updates  the  running  sample  average  Q(s(h),a),  either  knowing 
or  just  assuming  that  this  way  Q(s(h),a )  will  eventually  converge 
to  the  true  Q-  value  of  a  at  s(h). 

•  DP-backups  implement  dynamic  programming  style  estimates 
propagation,  resembling  Bellman  backups  in  value  iteration.  With 
DP-backups,  action  value  estimates  Q(s(h),a)  are  updated  by  the 
weighted  sum  of  the  value  estimates  of  the  empirically  best  ac¬ 
tions  at  the  outcomes  of  a  (discovered  so  far),  with  the  weights 
being  induced  by  the  gradually  learned  parameters  of  the  MDP’s 
stochastic  transition  function. 

Earlier  MCTS  algorithms,  such  as  flat  MC,  £-greedy,  UCT,  and 
their  numerous  variations  [3],  all  reflected  rather  directly  the  algo¬ 
rithms  for  reinforcement  learning- while-acting  in  multi-armed  bandit 


problems  (MAB)  [18]:  Given  arollout  p,  update  (the  selected)  action 
value  estimates  “by  p",  that  is,  by  the  actual  rewards  obtained  along 
the  rollout.  Recent  works  on  MCTS  algorithms  for  online  MDP  plan¬ 
ning  examined  the  important  differences  between  the  (single  state) 
MABs  and  (multi-state)  general  MDPs,  leading  to  what  was  baptized 
as  the  principle  of  separation  of  concerns  [9]:  Instead  of  updating  “by 
p”,  update  (the  selected)  action  value  estimates  “along  the  trajectory 
of  p”  by  some  information  that  goes  beyond,  and  possibly  even  has 
nothing  to  do  with,  the  specific  rewards  achieved  by  p.  In  particular, 
one  can  use  one  of  the  following. 

1 .  MC-updates  along  additional  rollouts,  issued  from  the  states  along 
p  according  to  a  special,  update-oriented,  “estimation”  policy. 
Such  an  UPDATE  procedure  in  particular  gives  rise  to  the  BRUE 
algorithm  [9],  and  it  is  depicted  in  Figure  lb,  together  with  a  gen¬ 
eral  template  for  its  ESTIMATE  policy. 

2.  DP-updates  along  the  trajectory  of  p,  as  depicted  in  Figure  lc. 
This  procedure  in  particular  gives  rise  to  the  MaxUCT  [14]  and 
MaxBRUE  [12]  algorithms. 

3  WORST-CASE  GUARANTEES  VS. 

REALISTIC  EXPECTATIONS 

Our  comparison  between  MC-backups  and  DP-backups  in  MCTS 
is  carried  through  two  particular  MCTS  algorithms,  BRUE  [9]  and 
MaxBRUE  [12],  which  guarantee  exponential-rate  reduction  of  sim¬ 
ple  regret.  The  only  difference  between  BRUE  and  MaxBRUE  is 
that  the  former  employs  MC-backups  while  the  latter  employs  DP- 
backups.  Hence,  for  ease  of  presentation,  in  what  follows  we  refer  to 
these  two  algorithms  as  MC  and  DP,  respectively.  Both  MC  and  DP 
use  uniform  sampling  for  ROLLOUT- ACTION,  and  both  use  the  same 
ROLLOUT-OUTCOME  that  samples  the  provided  generative  model  of 
the  action’s  transition  function.  With  UPDATE  as  in  Figure  lc,  that 
basically  concludes  the  definition  of  DP.  The  UPDATE  procedure  of 
MC  in  Figure  lb,  and  in  particular,  its  ESTIMATE  subroutine,  needs 
one  more  choice  to  be  made. 

In  analogy  to  DP-backup  that  propagates  the  value  of  the  empir¬ 
ically  best  actions,  MC  in  ESTIMATE  makes  the  estimation  rollouts 
along  the  empirically  best  actions  (selected  by  EST- ACTION).  Thus, 


in  particular,  no  implementation  choices  are  left  open  here.  In  con¬ 
trast,  for  outcome  selection  along  the  estimation  rollouts,  two  options 
are  plausible,  and  both  are  viable  in  terms  of  consistency  and  perfor¬ 
mance  guarantees.  One  option  is  to  use  the  generative  model  of  the 
action’s  transition  function,  same  as  in  ROLLOUT-OUTCOME,  while 
another  option  is  to  estimate  the  transition  probabilities,  similarly  to 
DP,  and  draw  samples  from  that  empirical  distribution. 

The  advantage  of  the  second  scheme  is  that  the  number  of  “oracle 
calls”  to  the  generative  model  is  similar  to  that  of  DP.  In  contrast,  the 
first  scheme  performs  a  factor  of  ^  more  calls  to  generative  model. 
In  applications  where  such  oracle  calls  are  expensive,  due  to,  e.g.,  a 
need  to  simulate  a  complex  physical  model,  this  can  be  an  important 
argument  for  using  the  second  scheme.  However,  the  advantage  of 
the  first  scheme  is  that  it  is  not  affected  by  the  errors  in  the  estimation 
of  the  transition  probabilities.  In  what  follows,  whenever  we  need  to 
distinguish  between  the  two  options,  we  will  use  MCm  to  refer  to  MC 
with  EST-OUTCOME  using  the  generative  model,  and  MCp  to  refer  to 
MC  with  EST-OUTCOME  using  the  estimated  transition  probabilities. 

As  we  already  mentioned,  both  DP  and  MC  have  been  recently 
proven  to  reduce  simple  regret  at  exponential  rate.  However,  the  cor¬ 
responding  statements  of  the  formal  bounds  in  [9]  and  [12]  are  some¬ 
what  involved,  and  this  complicates  the  comparative  analysis  and 
discussion  of  DP  and  MC  that  we  want  to  make  here.  Propositions  1 
and  2  below  provide  much  more  accessible  formal  bounds  on  the  per¬ 
formance  of  DP  and  MC,  simplified  by  assuming  B,  K  3>  0,  which 
allows  keeping  track  only  of  the  highest  order  factors  of  B  and  K, 
and  replacing  uniform  ROLLOUT- ACTION  with  round-robin,  which  is 
equivalent  in  expectation  but  allows  for  simplifying  the  bounds  fur¬ 
ther.  We  note  that,  while  the  bound  for  D  P  in  Proposition  1  is  qual¬ 
itatively  similar  to  this  in  [12],  the  bound  for  MC  in  Proposition  2 
actually  improves  over  the  result  in  [9].  The  proofs  are  delegated  to 
a  technical  report  [11]. 

Proposition  1  Let  irB(so(H))  be  the  action  recommendation  of  DP 
after  applying  n  iterations.  Then, 

P ^Q{so(H),tt*(so(H)))  -  Q{s0(H),nB{s0(H)))  >  <*} 

_ 

<  K(2BK)H~1e  iK(BK)H-iH2 

Proposition  2  Let  kb(so{H))  be  the  action  recommendation  of  MC 
after  applying  n  iterations.  Then, 


P  ^Q(so{H) ,ir* (so{H)))  -  Q{so{H),7tB(s0{H}))  >  i} 


< 


(6B2K 
V  A2 


(9BIv)^if_1)2(iT  —  l)!2e  2 k(sbk)b  i h? 


Roughly  speaking,  the  exponents  in  the  bounds  in  Propositions  1 
and  2  capture  the  reduction  rate  of  the  simple  regret,  while  the  mul¬ 
tiplicative  factors  capture  the  length  of  the  “cooling  periods”  after 
which  the  respective  bounds  become  meaningful.  In  that  respect,  the 
convergence  rates  of  D  P  and  M  C  appear  to  be  rather  comparable,  ex¬ 
cluding  numerical  constants,  while  the  “cooling  period"  of  MC  ap¬ 
pears  to  be  much  longer  than  that  of  DP.  The  latter  suggests,  even  if 
only  informally,  that  the  empirical  performance  of  D  P  should  be  ex¬ 
pected  to  be  more  attractive  than  the  empirical  performance  of  MC. 
However,  a  deeper  inspection  of  M  C  and  D  P  below  suggests  a  differ¬ 
ent  perspective  on  the  relative  attractiveness  of  these  two  algorithms, 
and  more  generally,  on  the  relative  attractiveness  of  MC-backups  and 
DP-backups  in  MCTS. 


First,  in  [9]  it  was  shown  that  the  formal  guarantees  of  MC  can  be 
improved  by  basing  the  action  value  estimators  only  on  a  fraction  a 
of  the  most  recent  samples.  This  enhancement  was  referred  in  [9]  as 
“learning  by  forgetting”.  For  some  specific  values  of  a  convenient  for 
our  discussion  here,  the  bound  for  M  C  from  Proposition  1  translates 
to  a  bound  for  the  “learning  by  forgetting”  MC(a)  as  in  Proposition  3 
below. 


Proposition  3  Let  nB(so{H))  be  the  action  recommendation  of 
M  C  (a)  after  applying  n  iterations  with  a  steps-to-go-dependent  av¬ 
eraging  fraction  ah  ~  .  Then,  we  have 
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As  it  appears,  the  worst-case  cooling  period  of  MC(a)  improves 
on  that  of  MC,  and  this  seems  to  come  at  no  cost  in  terms  of  the 
convergence  rate,  expressed  by  the  exponent.  However,  as  we  ex¬ 
plain  below,  it  seems  that  this  improvement  of  the  bound  in  Proposi¬ 
tion  3  should  be  attributed  mostly  to  the  looseness  of  the  bound  for 
the  standard  setup  of  MC,  and  much  less  to  the  actual  improvement 
of  the  performance  measures.  Indeed,  adopting  “learning  by  forget¬ 
ting”  leads  to  only  minor  empirical  improvement,  if  at  all.2  Never¬ 
theless,  in  comparison  to  DP,  the  cooling  period  length  of  MC(a) 
has  stronger  dependency  on  the  accuracy  level  <5  and  the  horizon  H. 

At  this  point,  two  things  should  be  noted  with  regards  to  the  above 
formal  bounds.  First,  by  definition,  formal  bounds  capture  the  worst- 
case  settings  of  the  MDP  parameters,  that  is,  uniform  transition  prob¬ 
ability  functions,  tree-structured  state  space,  etc.  As  such,  the  bounds 
tend  to  blur  certain  advantages  of  one  algorithm  over  another  in 
solving  MDPs  with  some  specific  (and  possibly  expected  in  prac¬ 
tice)  characteristics.  Second,  due  to  conceptual  differences  between 
the  dynamics  of  MC-  and  DP-backups,  derivation  of  the  bounds  in 
Propositions  1  and  2  is  based  on  two  very  different  types  of  analy¬ 
sis.  Hence,  unlike  what  often  happens  for  conceptually  close  tech¬ 
niques  [5,  9],  the  value  of  formal  bounds  as  indicators  for  the  rela¬ 
tive  attractiveness  of  MC  and  DP  is  questionable.  Having  these  two 
reservations  in  mind,  in  what  follows  we  provide  a  more  conceptual 
(aka  less  mathematically  specific)  comparative  analysis  of  MC  and 
D  P  by  exploring  several  key  features  of  MDP  models,  and  reasoning 
about  the  (possibly  different )  effects  of  each  of  these  features  on  the 
performance  of  the  two  algorithms. 


Branching  factors  B  and  K  (The  size  of  the  problem)  In  both 
Proposition  1  and  Proposition  2,  the  basis  of  the  analysis  that  gives 
rise  to  the  formal  guarantees  is  the  fact  that  identifying  the  optimal 
action  a*  at  the  root  node  so(H)  requires  that 

(1)  the  value  of  a *  at  So(H)  is  not  too  underestimated,  and  that 

(2)  the  values  of  all  the  other,  sub-optimal  actions  at  so  (H)  are  not 
too  overestimated. 

Both  MC  and  DP  ensure  that  these  accuracy  requirements  are  met, 
yet  they  differ  in  the  way  that  these  requirements  recursively  translate 
into  requirements  from  the  descendants  of  so(H). 

In  DP,  to  ensure  that  a  sub-optimal  action  a  is  not  too  overes¬ 
timated,  all  the  applicable  actions  in  all  of  the  outcome  states  of  a 
must  not  be  too  overestimated.  Thus,  the  accuracy  of  estimating  a 

2  This  was  observed  both  in  our  experiments  for  this  work,  as  well  as  in  [9]. 


sub-optimal  action  a  in  DP  translates  to  accuracy  requirements  be¬ 
ing  posed  to  all  the  possible  BK  immediate  action  successors  of  a. 
In  contrast,  in  MC,  the  likelihood  that  a  sub-optimal  action  a  will  be 
too  overestimated  is  negligible,  and  this  because  the  expected  value 
of  the  samples  that  induce  the  estimate  of  a  is  upper-bounded  by 
the  true  value  of  a,  regardless  of  the  estimates  of  the  action  succes¬ 
sors  of  a.  Thus,  the  accuracy  of  estimating  a  sub-optimal  action  a 
in  MC  translates  to  no  accuracy  requirements  from  the  successors 
of  a.  In  sum,  in  terms  of  “not  overestimating  sub-optimal  actions”, 
MC-backups  seem  to  be  clearly  preferred  to  DP-backups. 

Examining  now  the  requirement  of  not  underestimating  the  opti¬ 
mal  action  a*  too  much,  meeting  this  requirement  in  DP  requires 
that  the  optimal  actions  at  all  of  the  outcome  states  of  a *  are  not  too 
underestimated.  Indeed,  if  the  latter  holds,  then  the  maximal  action 
values  propagated  to  a*  from  its  outcome  states  are  also  not  too  un¬ 
derestimating.  Thus,  the  requirement  of  “not  too  underestimating  the 
optimal  action”  a*  at  So(H)  translates  in  DP  into  B  similar  require¬ 
ments  being  posed  to  all  of  the  state  successors  of  So(H)  via  a*. 

When  it  comes  to  MC,  the  picture  is  somewhat  more  complicated. 
Not  underestimating  the  optimal  action  a*  too  much  requires  that,  in 
expectation,  each  of  the  samples  inducing  the  estimate  Q(so(H),  a*) 
does  not  underestimate  too  much  the  true  value  of  a * .  This  implies 
that  all  of  the  outcome  states  of  a *  should  “identify”  their  optimal  ac¬ 
tions,  which  in  turn  translates  into  accuracy  requirements  posed  to  all 
(both  optimal  and  sub-optimal)  actions  applicable  at  these  outcome 
states  of  a* .  As  we  just  mentioned,  the  accuracy  requirements  from 
sub-optimal  actions  in  M  C  are  negligible,  and  thus  the  effective  bur¬ 
den  is  only  with  the  accuracy  requirements  from  the  optimal  actions 
at  the  outcome  states. 

In  sum,  the  requirement  of  not  underestimating  the  optimal  action 
translates  to  accuracy  requirements  from  the  optimal  actions  at  the 
outcome  states,  at  different  stages  of  planning.  In  the  lack  of  more  ef¬ 
fective  proof  methods,  the  accuracy  of  each  estimation  sample  in  the 
proof  of  Proposition  2  is  considered  in  isolation,  and  this  is  precisely 
the  point  where  the  difference  between  the  bound  and  the  actual  per¬ 
formance  may  inflate.  Indeed,  the  accuracy  of  an  action  estimate  cor¬ 
relates  with  the  accuracy  of  the  same  action  estimate  at  subsequent 
points  in  time,  yet  this  correlation  is  not  factored  into  the  bounds. 

Importantly,  the  analysis  of  MC(ct)  in  that  respect  is  not  any  dif¬ 
ferent:  partial  averaging  does  not  offer  a  way  to  factor  this  correla¬ 
tion,  but  only  reduces  the  bound  by  imposing  accuracy  requirements 
on  fewer  samples.  Clearly,  as  the  number  of  iterations  increases,  the 
correlation  increases  as  well.  The  question,  however,  is  when  can  we 
expect  to  have  higher  correlation  at  earlier  stages.  For  instance,  when 
the  probability  mass  of  the  transition  function  concentrates  on  a  small 
number  of  outcomes,  the  effective  action  branching  becomes  smaller 
than  the  nominal  action  branching  B.  In  such  cases,  one  should  ex¬ 
pect  to  have  more  samples  based  on  overlapping  rollouts,  and  thus  to 
have  a  higher  correlation.  In  any  case,  when  the  correlation  is  high, 
M  C  becomes  equivalent  to  D  P  in  terms  of  the  accuracy  requirement 
on  the  optimal  action. 

In  summary,  the  dependency  of  DP  on  B  and  K  is  of  order 
(BK)H ,  whereas,  depending  on  the  correlation,  the  dependency  of 
MC  on  B  and  K  can  be  of  order  as  small  as  BH .  Therefore,  MC  can 
be  expected  to  be  less  sensitive  than  DP  to  K,  especially  when  the 
effective  action  branching  is  relatively  low,  and  thus  the  correlation 
is  relatively  high. 

The  shape  of  the  reward  function  If  right  from  the  first  steps,  the 
immediate  rewards  of  the  optimal  actions  appear  more  attractive  than 
these  of  the  suboptimal  actions,  then  identifying  the  optimal  action 


a*  at  so(H)  is  somewhat  a  simpler  problem.  The  more  challenging 
cases  in  that  respect  are  when  the  discriminative  rewards  are  pushed 
down  the  search  tree,  similarly  to  what  happens  in  goal-driven  MDPs. 
In  such  cases,  identifying  the  optimal  action  a*  at  So(H)  requires 
properly  identifying  optimal  actions  far  from  the  root,  where  samples 
are  much  sparser.  Relating  this  point  to  the  previous  discussion  on  the 
size  of  the  problem,  it  can  be  expected  that  the  advantage  of  M  C  in 
the  more  challenging  cases  becomes  more  dependent  on  the  effective 
action  branching  being  relatively  low. 

The  entropy  of  the  transition  function  For  all  the  bounds,  the 
factor  in  the  exponent  results  from  the  worst-case  transition 
probability  function,  which,  for  each  action,  induces  a  uniform  dis¬ 
tribution  over  its  outcomes.  Clearly,  as  the  entropy  of  the  transition 
functions  decreases,  the  better  the  bounds  and  performance  of  both 
DP  and  MC  would  be.  However,  here  as  well,  some  differences  are 
expected  depending  on  the  update  scheme.  Since  DP  and  MCp  use 
in  their  UPDATE  the  estimated  transition  function,  their  value  estima¬ 
tions  would  be  skewed  towards  the  value  of  the  more  probable  out¬ 
comes.  Although  this  skew  decreases  with  the  number  of  samples, 
this  decrease  is  slower  at  the  deeper  nodes  since  they  are  sampled 
less  frequently.  MCm,  on  the  other  hand,  is  free  from  this  type  of 
inaccuracy  and  therefore  has  certain  advantage  in  that  respect. 

4  EXPERIMENTAL  STUDY 

In  what  follows,  we  put  the  qualitative  comparison  above  into  an  em¬ 
pirical  test.  In  previous  work,  the  empirical  effectiveness  of  online 
MDP  planning  algorithms  was  typically  examined  on  a  set  of  spe¬ 
cific  MDP  problems,  such  as  the  benchmark  suites  of  planning  com¬ 
petitions  (IPPC).  These  benchmarks,  however,  are  problematic  to  use 
if  one  wants  to  examine  the  marginal  impact  of  various  parameters 
of  the  MDPs  on  the  effectiveness  of  the  algorithms,  because  these 
parameters  simply  cannot  be  controlled.  In  fact,  almost  all  of  these 
benchmarks  are  too  large  to  compute  the  actual  value  of  different  ac¬ 
tions  at  a  state,  and  without  that,  assessing  simple  regret  of  different 
algorithms  is  impossible.  Taking  that  on  board,  we  devised  a  para¬ 
metric  MDP  model  from  which  one  can  select  MDP  instances  with 
(i)  arbitrary  set  of  action  values  at  the  initial  state,  and  (ii)  arbitrary 
setting  of  the  parameters  discussed  in  the  previous  section.  This  al¬ 
lows  us  to  experiment  with  large  MDPs,  for  which  otherwise  it  would 
be  impossible  (in  reasonable  time  and  computational  resources)  to 
compute  the  value  function,  and  based  on  it,  assess  simple  regret  of 
different  algorithms. 

For  ease  of  presentation,  in  what  follows  we  refer  to  nodes  s(h) 
simply  by  s;  the  steps-to-go  component  of  the  nodes  remains  clear 
from  the  text.  In  our  base  MDP  setup,  the  horizon  is  set  to  H  =  10, 
there  are  exactly  K  =  20  actions  applicable  at  every  node,  and  each 
node/action  pair  induces  exactly  B  equiprobable  outcome  nodes,  ex¬ 
clusive  to  that  node/action  pair,  i.e,  the  induced  state-space  is  tree- 
structured,  and  the  transition  probability  functions  are  all  uniform. 
For  a  node  s,  an  applicable  action  a.  and  a  possible  outcome  s' ,  the 
immediate  reward  is  set  to  R(s,  a,  s')  =  ®(s’a> ,  and  the  value  of  the 
outcome  node  receives  the  remainder  V ( s' )  =  Q(s ,  a)  —  R(s,  a,  s'). 
At  any  node  s,  there  is  exactly  one  optimal  action,  the  value  of  which 
equals  the  value  V ( s )  of  the  node,  whereas  all  other  actions  have 
identical  values  of  eV (s),  for  some  e  £  (0, 1).  The  choice  of  e  plays 
an  important  role  here.  If,  for  instance,  e  is  set  equally  for  all  nodes, 
then  basing  the  value  updates  in  both  MC  and  DP  on  random  (and 
not  empirically  best)  action  successors  will  surface  the  optimal  ac¬ 
tion  at  so ,  and  this  because  all  the  actions  will  be  underestimated  by 


Figure  2.  Experimental  results  on  the  base  setup  with  K  —  20,  B  —  20,  and  all  action  outcomes  being  equiprobable  (top-center),  as  well  as  on  the  variants 
with  („/)  K  =  200,  (\)  B  =  200,  (l)  K  —  200  and  B  =  200,  («— )  “good  likely”  transition  functions,  and  (— >)  “bad  likely”  transition  functions. 


a  similar  magnitude.  Therefore,  in  our  setup,  for  all  nodes  reachable 
by  the  optimal  policy  from  so,  we  set  e  =  0.6,  while  for  all  nodes 
off  the  optimal  policy,  we  set  e  =  0.8.  The  only  node  that  is  not 
properly  covered  by  this  categorization  is  the  actual  initial  node  so, 
and  there  we  also  set  e  =  0.8.  In  this  setup,  the  estimates  induced 
by  random  updates  would  not  preserve  the  right  order  of  the  action 
values,  imposing  a  harder  challenge  on  the  algorithms. 

The  emphasized  plot  in  the  top-center  of  Figure  2  depicts  the  sim¬ 
ple  regret  obtained  by  the  three  examined  algorithms,  DP,  MCp, 
and  MCm,  on  the  base  setup  as  above,  as  a  function  of  the  num¬ 
ber  of  iterations.  While  MC  here  appear  slightly  better  at  the  start, 
DP  quickly  catches  up  and  gradually  outperforms  MC.  However, 
the  picture  changes  when  the  base  setup  is  modified  in  several  dif¬ 
ferent  ways.  First,  when  scaling  up  the  problem  by  increasing  K 
to  200  (bottom-left),  or  by  increasing  B  to  200  (bottom-right),  or 
both  (bottom-center),  MC  performs  much  better  than  DP  at  all  times, 
with  MCm  being  the  clear  dominator.  Second,  the  top-left  and  top- 
right  plots  in  Figure  2  depict  the  results  for  two  setups  that  deviate 
from  the  base  only  by  altering  the  entropy  of  the  transition  proba¬ 
bility  functions.  In  both  setups,  for  each  node  s  and  each  applica¬ 
ble  action  a,  one  outcome  s'  is  substantially  more  likely  than  all 
other,  with  P(s'  |  s,  a)  =  0.9  and,  for  all  outcomes  s"  ^  s, 
P(s"  |  s,  a)  =  -gzr[.  The  difference  between  the  two  setups  is  the 
relative  value  of  the  more  likely  outcome  s':  In  the  “good  likely” 
setup  (GL),  the  more  likely  outcome  is  also  more  valuable,  i.e., 
R(s,  a,  s')  +  V (s')  >  R(s,  a,  s")  +  V (s")  for  all  s"  ^  s',  and  in 
the  “bad  likely”  setup  (BL),  it  is  the  other  way  around.  In  either  case, 
all  action-outcome  values  are  set  such  that  (1)  they  reflect  the  value 
of  the  action,  i.e.  P(s'  |  s,  a)  (R(s,  a,  s')  +  V (s'))  =  Q(s,  a), 
and  (2)  each  action-outcome  value  is  neither  smaller  than  half  of  the 
action  value,  nor  higher  than  the  maximal  immediate  reward  (=  1,  in 
our  experiments),  times  the  number  of  steps-to-go. 

In  both  “good  likely”  and  “bad  likely”  variants,  the  reduction  in 
the  entropy  of  the  transition  function  basically  reduces  the  effective 
state  branching  of  the  actions,  and  thus  the  correlation  between  the 


successive  samples  in  MC  is  expected  to  grow,  getting  more  in  line 
with  the  optimistic  assumption  on  MC’s  dependence  on  B  and  K. 
The  results  depicted  in  Figure  2  support  this  expectation.  Moving 
from  the  base  setup,  the  performance  of  MC  improves  in  both  “good 
likely”  and  “bad  likely”  setups,  and  in  fact,  in  both  setups,  MC  out¬ 
performs  DP.  Likewise,  importantly,  while  in  “good  likely”  there  is 
effectively  no  difference  between  MCm  and  MCp,  in  “bad  likely”, 
MCm  is  performing  much  better  than  MCp,  with  the  latter  meeting 
the  very  poor  performance  of  DP.  Basically,  the  “bad  likely”  setup 
demonstrates  how  dramatic  can  be  the  implications  of  establishing 
value  estimation  on  the  estimated  transition  probabilities.  Here,  the 
underestimation  of  the  values  by  DP  and  MCp  results  in  their  very 
poor  performance.  To  recap  Figure  2,  it  appears  that  MC  outperforms 
DP  except  for  on  MDPs  of  relatively  small  size,  and  MCm  being  jus¬ 
tifiably  more  robust  than  MCp. 

Another  important  aspect  that  we  examined  in  our  experiments 
pertains  to  the  dependence  of  the  algorithm  performance  on  the  shape 
of  the  rewards  as  a  function  of  the  node  depth.  In  the  base  setup, 
at  any  node,  the  actions  are  rewarded  proportionally  to  their  actual 
value,  and  thus,  in  particular,  optimal  actions  have  higher  immediate 
rewards  than  the  sub-optimal  actions. 

Figure  3  shows  the  results  for  two  setups  that  deviate  from  the  base 
setup  in  that  aspect  as  follows.  (Both  these  setups  are  more  challeng¬ 
ing  than  the  base,  and  thus  the  x-axis  in  Figure  3  goes  up  to  106 
iterations,  and  not  to  104  iterations  like  in  Figure  2.)  In  “first  equal” 
(top-center),  the  immediate  rewards  differ  from  the  base  setup  only 
at  the  root,  where,  instead  of  rewarding  the  optimal  action  higher 
than  the  sub-optimal  actions,  all  the  actions  have  the  same  reward 
of  0.5,  independently  of  the  outcome.  The  results  for  the  “good 
likely”  and  “bad  likely”  variants  of  “first  equal”  are  depicted  in  top- 
left  and  top-right  comers  of  Figure  3.  In  the  even  more  challenging 
setup  “first  few  equal”  (bottom-center),  the  immediate  rewards  are 
set  to  the  minimum  between  0.5  and  the  action-outcome  value,  that 
is  R(s,a,s')  =  min{0.5,  Q(s,  a)};  in  the  “good  likely”  (bottom- 
left)  and  “bad  likely”  (bottom-right)  variants,  the  appropriate  factor 


Figure  3.  Experimental  results  on  the  “first  equal”  (top-center)  and  “first  few  equal”  (bottom-center)  modifications  of  the  base  setup,  as  well  as  on  their 

variants  with  “good  likely”  (<— )  and  “bad  likely”  (— >)  transition  functions. 


is  added  to  Q(s ,  a). 

Comparing  the  results  for  the  variants  of  the  base  setup  in  Figure  2 
with  the  results  for  “first  equal”  and  “first  few  equal”  in  Figure  3,  the 
qualitative  relative  performance  of  DP,  MCp,  and  MCm  remains  the 
same,  with  the  absolute  performance  of  all  algorithm  decreasing,  as 
expected,  from  the  base  setup  to  “first  equal”,  and  from  “first  equal” 
to  “first  few  equal”.  It  should  also  be  noted  that  here,  in  contrast  to 
the  base  setup,  the  advantage  of  DP  over  MC  under  equiprobable  ac¬ 
tion  outcomes  was  observed  also  when  K  and  B  were  higher  than 
20.  This  goes  in  line  with  the  dependence  of  MC’s  performance  on 
the  correlation  between  the  successive  samples,  because  pushing  the 
discriminative  rewards  down  the  tree  delays  the  correlation.  Finally, 
we  also  experimented  with  various  graph-structured  (in  contrast  to 
tree-structured)  variants  of  our  MDP  model.  As  expected,  the  perfor¬ 
mance  of  all  three  algorithms  improved  with  the  degree  of  the  multi¬ 
connectedness  of  the  nodes,  but  the  improvements  were  of  the  same 
magnitude  for  all  three  algorithms.  In  sum,  based  on  our  experiments 
and  in  line  with  the  analysis  in  Section  3,  DP  appear  more  effective 
than  MC  as  long  as  the  size  of  the  problem  is  sufficiently  small,  but 
otherwise,  MC  outperforms  DP  even  under  most  challenging  con¬ 
ditions,  especially  if  the  probability  mass  of  the  transition  functions 
concentrates  on  very  few  outcomes. 
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Abstract 

Linking  online  planning  for  MDPs  with  their  special 
case  of  stochastic  multi-armed  bandit  problems,  we  an¬ 
alyze  three  state-of-the-art  Monte-Carlo  tree  search  al¬ 
gorithms:  UCT,  BRUE,  and  MaxUCT.  Using  the  out¬ 
come,  we  (i)  introduce  two  new  MCTS  algorithms, 
MaxBRUE,  which  combines  uniform  sampling  with 
Bellman  backups,  and  MpaUCT,  which  combines  UCB1 
with  a  novel  backup  procedure,  (ii)  analyze  them  for¬ 
mally  and  empirically,  and  (iii)  show  how  MCTS  al¬ 
gorithms  can  be  further  stratified  by  an  exploration 
control  mechanism  that  improves  their  empirical  per¬ 
formance  without  harming  the  formal  guarantees. 

Introduction 

In  online  planning  for  MDPs,  the  agent  focuses  on  its 
current  state  only,  deliberates  about  the  set  of  possible 
policies  from  that  state  onwards  and,  when  interrupted, 
chooses  what  action  to  perform  next.  In  formal  analysis 
of  algorithms  for  online  MDP  planning,  the  quality  of 
the  action  a,  chosen  for  state  s  with  H  steps-to-go,  is 
assessed  in  terms  of  the  simple  regret  measure,  captur¬ 
ing  the  performance  loss  that  results  from  taking  a  and 
then  following  an  optimal  policy  ir*  for  the  remaining 
H  —  1  steps,  instead  of  following  n*  from  the  begin¬ 
ning  (Bubeck  and  Munos  2010). 

Most  algorithms  for  online  MDP  planning  consti¬ 
tute  variants  of  what  is  called  Monte-Carlo  tree  search 
(MCTS)  (Sutton  and  Barto  1998;  Peret  and  Garcia 
2004;  Kocsis  and  Szepesvari  2006;  Coquelin  and  Munos 
2007;  Cazenave  2009;  Rosin  2011;  Tolpin  and  Shi- 
mony  2012).  When  the  MDP  is  specified  declaratively, 
that  is,  when  all  its  parameters  are  provided  explic¬ 
itly,  the  palette  of  algorithmic  choices  is  wider  (Bonet 
and  Geffner  2012;  Kolobov,  Mausam,  and  Weld  2012; 
Busoniu  and  Munos  2012;  Keller  and  Helmert  2013). 
However,  when  only  a  generative  model  of  MDP  is 
available,  that  is,  when  the  actions  of  the  MDP  are 
given  only  by  their  “black  box”  simulators,  MCTS 
algorithms  are  basically  the  only  choice.  In  MCTS, 
agent  deliberation  is  based  on  simulated  sequential  sam¬ 
pling  of  the  state  space.  MCTS  algorithms  have  also 
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become  popular  in  other  settings  of  sequential  deci¬ 
sion  making,  including  those  with  partial  state  observ¬ 
ability  and  adversarial  effects  (Geliy  and  Silver  2011; 
Sturtevant  2008;  Bjarnason,  Fern,  and  Tadepalli  2009; 
Balia  and  Fern  2009;  Eyerich,  Keller,  and  Helmert  2010; 
Browne  et  al.  2012). 

The  popularity  of  MCTS  methods  is  due  in  part  to 
their  ability  to  their  ability  to  deal  with  generative 
problem  representations,  but  they  have  other  desirable 
features  as  well.  First,  while  MCTS  algorithms  can  na¬ 
tively  exploit  problem-specific  heuristic  functions,  their 
correctness  is  independent  of  the  heuristic’s  properties, 
and  they  can  as  well  be  applied  without  any  heuristic 
information  whatsoever.  Second,  numerous  MCTS  al¬ 
gorithms  exhibit  strong  anytimeness:  not  only  can  a 
meaningful  action  recommendation  be  provided  at  any 
interruption  point  instantly,  in  time  0(1),  but  the  qual¬ 
ity  of  the  recommendation  also  improves  very  smoothly, 
in  time  steps  that  are  independent  of  the  size  of  the  ex¬ 
plored  state  space. 

Fundamental  developments  in  the  area  of  MCTS 
algorithms  can  all  be  traced  back  to  stochastic 
multi-armed  bandit  (MAB)  problems  (Robbins  1952). 
Here  we  take  a  closer  look  at  three  state-of-the- 
art  MCTS  algorithms,  UCT  (Kocsis  and  Szepesvari 
2006),  BRUE  (Feldman  and  Domshlak  2012;  2013),  and 
MaxUCT  (Keller  and  Helmert  2013),  linking  them  to 
algorithms  for  online  planning  in  MABs.  This  analysis 
leads  to  certain  interesting  realizations  about  the  ex¬ 
amined  MCTS  algorithms.  Taking  these  realizations  as 
our  point  of  departure,  we: 

•  Introduce  two  new  MCTS  algorithms,  MaxBRUE, 
which  combines  uniform  sampling  with  Bellman 
backups,  and  MpaUCT  which  combines  UCB1  with 
a  novel  backup  procedure; 

•  Establish  formal  guarantees  of  exponential-rate  con¬ 
vergence  for  MaxBRUE  (that  turn  out  to  be  even 
stronger  than  those  known  to  be  provided  by  BRUE), 
and  a  hint  about  the  polynomial-rate  convergence  of 

MpaUCT; 

•  Demonstrate  empirically  that,  in  line  with  the  em¬ 
pirical  analysis  of  pure  exploration  in  MAB  (Bubeck 
and  Munos  2010),  MaxBRUE  performs  better  than 


MpallCT  and  MaxLICT  under  a  permissive  planning¬ 
time  allowance,  while  the  opposite  holds  under  short 
planning  times; 

•  Show  how  MaxBRUE  (and  probably  other  algo¬ 
rithms)  can  be  stratified  by  an  exploration  control 
mechanism  that  substantially  improves  the  empirical 
performance  without  harming  the  formal  guarantees. 

Background 

Henceforth,  the  operation  of  drawing  a  sample  from  a 
distribution  V  over  set  H  is  denoted  by  ~  X>[N],  U  de¬ 
notes  uniform  distribution,  and  [n]  for  n  €  N  denotes 
the  set  {1, . . .  ,n}.  For  a  sequence  of  tuples  p,  p[i\  de¬ 
notes  the  i-tli  tuple  along  p,  and  p[i\.x  denotes  the  value 
of  the  field  x  in  that  tuple. 

Markov  Decision  Processes.  MDP  is  a  standard 
model  for  planning  under  uncertainty  (Puterman  1994). 
An  MDP  ( S ,  A,  P,  R)  is  defined  by  a  set  of  states  S,  a  set 
of  state  transforming  actions  A,  a  stochastic  transition 
function  P:SxdxS->[0,l],  and  a  reward  function 
R  :  S  x  Ax  S  R.  The  states  are  fully  observable  and, 
in  the  finite  horizon  setting  considered  here,  the  rewards 
are  accumulated  over  some  predefined  number  of  steps 
H.  In  what  follows,  s(h)  denotes  an  MDP  state  s  with 
h  steps-to-go,  and  A(s)  C  A  denotes  the  actions  appli¬ 
cable  in  state  s.  The  objective  of  planning  in  MDPs 
is  to  sequentially  choose  actions  so  as  to  maximize  the 
accumulated  reward.  The  representation  of  large-scale 
MDPs  can  be  either  declarative  or  generative,  but  any¬ 
way  concise,  and  allowing  for  simulated  execution  of  all 
feasible  action  sequences,  from  any  state  of  the  MDP. 
Henceforce,  the  state  and  action  branching  factors  of 
the  MDP  in  question  are  denoted  by  K  =  maxs  |A(s)| 
and  B  =  maxSjQ  |{s'  |  P(s'|s,  a)  >  0} |  respectively. 

Simple  Regret  Minimization  in  MAB.  A 
stochastic  multi-armed  bandit  (MAB)  problem  is  an 
MDP  defined  over  a  single  state  s.  The  actions  in 
MABs  do  not  affect  the  state,  but  are  associated  with 
stochastic  rewards.  Most  research  on  MABs  has  been 
devoted  to  the  setup  of  reinforcement  learning-while¬ 
acting,  where  the  cumulative  regret  is  of  interest  and 
exploration  must  be  intertwined  with  exploitation.  For 
this  setup,  an  action  selection  strategy  called  UCB1 
was  shown  to  attain  the  optimal  logarithmic  cumula¬ 
tive  regret  by  balancing  the  empirical  attractiveness  of 
the  actions  with  the  potential  of  less  sampled  actions. 
Specifically,  UCB1  samples  each  action  once,  and  then 
iteratively  selects  actions  as 

~  .  /logn 

argmax  pa  +  a\  -  , 

a  [  V  na 

where  n  is  the  total  number  of  samples  so  far,  na  is 
the  number  of  samples  that  went  to  action  a,  and  pa  is 
the  average  reward  of  these  samples  of  a.  The  param¬ 
eter  a  is  an  exploration  factor  that  balances  the  two 
components  of  the  UCB1  formula. 

In  contrast  to  learning-while-acting,  in  online  plan¬ 
ning  for  MAB  the  agent  is  provided  with  a  simula- 
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Table  1:  Upper  bounds  on  the  expected  simple  regret 
of  some  online  planning  algorithms  for  MAB  (Bubeck 
and  Munos  2010) 

tor  that  can  be  used  “free  of  charge”  to  evaluate  the 
alternative  actions  by  drawing  samples  from  their  re¬ 
ward  distributions.  An  algorithm  for  online  planning 
for  MAB  is  defined  by  an  exploration  strategy,  used 
to  sample  the  actions,  and  a  recommendation  strategy, 
used  at  the  end  of  the  planning  to  select  an  action  that 
is  believed  to  minimize  simple  regret.  Recently,  Bubeck 
et  al.  (2010)  investigated  worst-case  convergence-rate 
guarantees  provided  by  various  MAB  planning  algo¬ 
rithms;  their  key  findings  are  depicted  in  Table  1.  Two 
exploration  strategies,  the  uniform  one  and  a  general¬ 
ization  of  UCB1,  have  been  examined  in  the  context 
of  “the  empirical  best  action”  (EBA)  and  “the  most 
played  action”  (MPA)  recommendation  strategies.  The 
table  provides  upper  bounds  on  the  expected  simple  re¬ 
gret  of  the  considered  pairs  of  exploration  (rows)  and 
recommendation  (columns)  strategies,  whereas  the  O 
symbols  are  distribution-dependent  constants.  Bubeck 
et  al.  (2010)  also  examined  these  algorithms  empirically, 
and  showed  that,  in  line  with  the  details  of  their  for¬ 
mal  analysis,  the  UCBl-based  exploration  strategy  out¬ 
performs  uniform+EBA  under  a  moderate  number  of 
samples,  while  the  opposite  holds  under  more  permis¬ 
sive  exploration  budgets. 

Monte-Carlo  Tree  Search.  MCTS,  a  canonical 
scheme  underlying  various  MCTS  algorithms  for  online 
MDP  planning,  is  depicted  in  Figure  1.  MCTS  explores 
the  state  space  in  the  radius  of  H  steps  from  the  initial 
state  so  by  iteratively  issuing  simulated  rollouts  from 
So-  Each  such  rollout  p  comprises  a  sequence  of  simu¬ 
lated  steps  (s,  a,  r,  s'),  where  s  is  a  state,  a  is  an  action 
applicable  in  s,  r  is  an  immediate  reward  collected  from 
issuing  the  action  a ,  and  s'  is  the  resulting  state.  In 
particular,  p  [0]  .s  =  s0  and  p[t].s'  =  p[f+l].s  for  all  t. 

Each  generated  rollout  is  used  to  update  some  vari¬ 
ables  of  interest.  These  variables  typically  include  at 
least  the  action  value  estimators  Q  (s(h),  a),  as  well  as 
the  counters  n(s(h),  a)  that  record  the  number  of  times 
the  corresponding  estimators  Q  ( s(h ),  a)  have  been  up¬ 
dated.  Instances  of  MCTS  vary  mostly  along  the  dif¬ 
ferent  implementation  of  the  strategies  stop-rollout, 
specifying  when  to  stop  a  rollout;  SELECT- ACTION,  pre¬ 
scribing  the  action  to  apply  in  the  current  state  of  the 
rollout;  and  UPDATE,  specifying  how  a  rollout  should 
update  the  maintained  variables. 

Once  interrupted,  MCTS  uses  the  information  col¬ 
lected  throughout  the  exploration  to  recommend  an  ac¬ 
tion  to  perform  at  state  so-  The  rollout-based  explo¬ 
ration  of  MCTS  is  especially  appealing  in  the  setup  of 
online  planning  because  it  allows  smooth  improvement 


MCTS:  [input:  (S,  A,  P,  R);  so  £  S'] 

while  time  permits  do 

p  i —  ROLLOUT  //  generate  rollout 

UPDATE  (p) 

return  argmaxa  Q(so(H),a) 

procedure  ROLLOUT 
p  t—  () ;  s  <r-  so;  t  4—  0 
while  not  STOP-ROLLOUT(p)  do 
a  SELECT- ACTION(s,  t) 

s'  4-  sample-outcome(s,  a,  t) 
r  4—  R(s,  a,  s') 
p[t\  <-  ( s,a,r,s ') 

S  4 —  s' 5  t  4 —  t  -f-  1 

return  p 

Figure  1:  Monte-Carlo  tree  search 

of  the  intermediate  quality  of  recommendation  by  prop¬ 
agating  to  the  root  information  from  states  at  deeper 
levels  in  iterations  of  low  complexity  of  0{H)  . 

MCTS  algorithms:  UCT,  BRUE,  and  MaxUCT. 
UCT,  one  of  the  most  popular  algorithms  for  online 
MDP  planning  to  date,  is  depicted  in  Figure  2  as  a 
particular  instantiation  of  MCTS.  In  UCT,  the  rollouts 
end  at  terminal  states,  i.e.,  at  depth  H  or  at  states 
with  no  applicable  actions.1  Each  rollout  updates  all 
value  estimators  Q(s(/i),a)  of  the  ( s(h),a )  pairs  en¬ 
countered  along  the  rollout.  The  estimators  are  up¬ 
dated  via  the  MC-BACKUP  procedure,  which  averages 
the  accumulated  reward  of  the  rollouts  from  s(h)  to 
terminal  states. 

Under  this  flow,  a  necessary  condition  for  the  value 
estimators  to  converge  to  their  true  values  is  that  the 
portion  of  samples  that  correspond  to  selections  of  op¬ 
timal  actions  must  tend  to  1  as  the  number  of  samples 
increases.  At  the  same  time,  in  order  to  increase  the 
confidence  that  the  optimal  actions  will  be  recognized 
at  the  nodes,  all  the  applicable  actions  must  be  sampled 
infinitely  often.  The  sampling  strategy  of  UCT,  UCB1, 
aims  at  achieving  precisely  that:  UCB1  ensures  that 
each  action  is  selected  at  least  a  logarithmic  number  of 
times,  and  that  suboptimal  actions  are  selected  at  most 
a  logarithmic  number  of  times;  thus,  the  proportion  of 
best-action  selections  indeed  tends  to  1. 

UCT  has  many  success  stories  and  much  of  this  suc¬ 
cess  is  accounted  for  by  the  exploitative  property  of 
UCT.  This  property  results  in  skewing  towards  more 
attractive  actions  right  from  the  beginning  of  explo¬ 
ration,  a  protocol  that  presumably  enables  fast  homing 
on  “good”  actions.  However,  Table  1  shows  that  ex¬ 
ploitation  may  considerably  slow  down  the  reduction 
of  simple  regret  over  time.  Indeed,  much  like  UCB1 
for  MABs,  UCT  achieves  only  polynomial- rate  reduc- 


xIn  a  more  popular  version  of  UCT,  a  rollout  ends  at  a 
newly  encountered  node,  but  this  is  secondary  to  our  dis¬ 
cussion. 


procedure  UPDATE(p) 
f  4r-  0 

for  d,  4—  |p|, . . . ,  1  do 
h  4—  H  —  d 
a  4 —  p[d].a 

n(s(h))  4  n(s(h ))  +  1 
n(s{h),a )  4—  n(s(h),a)  +  1 
f  4—  f  +  p  [d]  .r 
MC-BACKUP(s(/i),  a,  f) 

procedure  MC-BACKUP(s(/i),  a,  f) 

Q(s(h),a)  4  <5(s(^)ia)  +  n(s(h),a) 


procedure  STOP-ROLLOUT^) 
t  -e-  \p\ 

return  t  =  H  or  A(p[t\.s’)  =  0 


procedure  select- action (s,d) 
h  £-  H  —  d 


if  3a  :  n  ( s(h ),  a)  =  0  then 
return  a 


return  argmaxa 


Q{s(h),  a)  +  c 


log  n(s(h)) 
n(s(h )  ,a) 


procedure  sample-outcome(s,  a,  t) 
return  s'  ~  P(5I  |  s,  a) 

Figure  2:  UCT  algorithm  as  a  specific  set  of  sub- routines 
for  MCTS 

tion  of  simple  regret  over  time  (Bubeck,  Munos,  and 
Stoltz  2011),  and  the  number  of  samples  after  which 
the  bounds  of  UCT  on  simple  regret  become  meaningful 
might  be  as  high  as  hyper-exponential  in  H  (Coquelin 
and  Munos  2007). 

Using  this  observation  and  following  the  findings  of 
Bubeck  et  al.  (Bubeck  and  Munos  2010),  in  our  ear¬ 
lier  work  we  introduced  the  concept  of  “separation  of 
concerns,”  whereby  the  first  part  of  each  rollout  is  de¬ 
voted  solely  to  the  purpose  of  selecting  particular  nodes, 
whereas  the  second  part  is  devoted  to  estimating  their 
value  (Feldman  and  Domshlak  2012).  We  showed  a  spe¬ 
cific  algorithm,  BRUE,  that  implements  this  concept  by 
always  updating  value  estimators  with  samples  that  ac¬ 
tivate  currently  best  actions  only,  but  the  estimators  to 
be  updated  are  chosen  by  rolling  out  actions  uniformly 
at  random.  It  turns  out  that,  in  contrast  to  UCT,  BRUE 
achieves  an  exponential-rate  reduction  of  simple  regret 
over  time,  with  the  bounds  on  simple  regret  becom¬ 
ing  meaningful  after  only  exponential  in  H2  number  of 
samples.  Moreover,  BRUE  was  also  shown  to  be  very 
effective  in  practice. 

Finally,  BRUE  was  not  the  only  successful  attempt 
to  improve  over  UCT.  In  particular,  Keller  &  Helmert 
(2013)  recently  introduced  MaxUCT,  a  modification  of 
UCT  in  which  MC  backups  are  replaced  with  Bellman 
backups  using  approximate  transition  probabilities,  and 
demonstrated  that  MaxUCT  substantially  outperforms 


EBA 

MPA 

UCB1 

Uniform 

BRUE.  MaxBRUE 

— 

— 

UCB1 

MaxUCT 

MpaUCT 

UCT 

Table  2:  MCTS  algorithms  for  MDPs  through  the  lens 
of  the  MAB  exploration  topology.  Rows  are  exploration 
strategies,  and  columns  are  recommendation  strategies. 


procedure  UPDATE(p) 
for  d  £-  \p\, . . . ,  1  do 
h  t—  H  —  d 
a  4—  p[d].a 
s'  4—  p[d].s' 
n(s(h))  <r-  n(s(h))  +  1 
n(s{h),a )  -f-  n(s(h),a)  +  1 
n(s(h),  a ,  s')  4—  n(s(h),a,  s')  +  1 
R(s(h)  ,a)  =  R(s(h),a)  +  p[d].r 
MPA-BACKUP(s(/i),  a) 


UCT  empirically.  In  terms  of  formal  guarantees,  how¬ 
ever,  there  is  no  dramatic  difference  between  the  con¬ 
vergence  rates  of  the  two  algorithms. 

From  MAB  to  MDP 

Relating  between  online  planning  for  MAB  and  for  more 
general  MDPs,  we  begin  by  drawing  ties  between 

(i)  MCTS  rollout  sampling  strategies  and  arm  explo¬ 
ration  strategies  in  MAB,  and 

(ii)  MCTS  selection  of  actions  used  to  update  search 
nodes  and  arm  recommendation  strategies  in 
MAB. 

Considering  the  three  algorithms  discussed  above  in 
that  perspective,  the  picture  appears  to  be  as  follows. 

•  UCT  combines  MC  backups  with  a  rollout  sampling 
driven  by  the  UCB1  action-selection  strategy.  Inter¬ 
estingly,  there  is  no  perfect  analogy  between  UCT  and 
a  reasonable  algorithm  for  pure  exploration  in  MAB. 
This  is  because,  at  all  nodes  but  the  root,  UCB1  de 
facto  drives  both  UCT’s  rollout  sampling  and  node  up¬ 
dates,  yet  recommending  an  arm  in  MAB  according 
to  UCB1  does  not  have  well  justified  semantics. 

•  BRUE  is  analogous  to  uniform  exploration  with  em¬ 
pirical  best  action  recommendation:  Applying  the 
principle  of  separation  of  concerns,  nodes  are  reached 
by  selecting  actions  uniformly,  and  the  samples  used 
in  the  MC  backups  are  generated  by  selecting  the 
empirical  best  actions. 

•  MaxUCT  is  analogous  to  UCB(a)  exploration  with 
best  empirical  action  recommendation:  With  Bell¬ 
man  backups,  the  updated  value  corresponds  to  the 
value  of  the  action  with  the  best  empirical  value. 
Interestingly,  this  perspective  reveals  that  switching 
from  MC  backups  to  Bellman  backups  in  MaxUCT 
essentially  constitutes  another  way  to  separate  con¬ 
cerns  in  the  sense  discussed  above. 


procedure  MPA-BACKUP(s(/i),  a) 
Q(s{h),a)  4- 
V  4-  0 

for  s'  £  {s'  |  n(s(h),a,  s')  >  0}  do 
A  4—  argmaxa,  n(s'{h  —  1),  a') 
a*  4—  argmaxa/gj4  Q(s'(h  —  1 ),  a') 

V  V  +  “*) 
Q(s(h),a)  4-  Q(s(h),a)  +  u 


Figure  3:  MpaUCT  as  UCT  with  a  modified  Update 
procedure 


findings  of  Bubeck  et  al.  (Bubeck,  Munos,  and  Stoltz 
2011)  that,  unlike  all  other  bounds  shown  in  Table  1, 
the  convergence  rate  of  UCB(a)+MPA  planning  on 
MAB  can  be  bounded  independently  of  the  problem 
parameters. 

A  simple  adaptation  of  the  MPA  recommendation 
strategy  to  MDPs  is  a  modification  of  the  Bellman 
backup:  Instead  of  folding  up  the  value  of  the  empiri¬ 
cally  best  action,  we  propagate  the  value  of  the  action 
that  was  updated  the  most.  Ties  are  broken  in  favor  of 
actions  with  better  empirical  value.  The  resulting  algo¬ 
rithm  is  depicted  in  Figure  3,  and  later  on  we  present 
our  empirical  findings  with  it. 

The  second  algorithm,  MaxBRUE,  is — like  BRUE 
analogous  to  uniform  exploration  with  EBA  recommen¬ 
dations,  but  it  employs  Bellman  backups  rather  than 
MC  backups.  MaxBRUE  is  depicted  in  Figure  4.  As 
we  show  in  the  proof  of  Theorem  1  below,  not  only 
does  MaxBRUE  achieve  exponential-rate  reduction  of 
simple  regret  similarly  to  BRUE,  but  the  particular  pa¬ 
rameters  of  the  convergence  bounds  are  more  attrac¬ 
tive  than  those  currently  known  for  BRUE.  Basically, 
Theorem  1  positions  MaxBRUE  as  the  worst-case  most 
efficient  MCTS  algorithm  for  online  MDP  planning  to 
date. 


Building  on  this  link  between  online  planning  for 
MAB  and  general  MDPs,  in  what  follows  we  present 
and  analyze  two  new  MCTS  algorithms  for  MDP  plan¬ 
ning.  The  union  of  the  known  and  new  algorithms  is 
depicted  in  Table  2,  with  the  names  of  the  new  algo¬ 
rithms  underscored. 

The  first  algorithm,  MpaUCT,  is  analogous  to 
UCB(a)  exploration  with  “most  played  action”  recom¬ 
mendations  in  MAB.  This  algorithm  is  inspired  by  the 


Theorem  1  Let  MaxBRUE  be  called  on  a  state  sq  of 
an  MDP  (S,  A,  P,  R)  with  rewards  in  [0,1],  and  finite 
horizon  H.  After  n  >  1  iterations  of  MaxBRUE.  we 
have  the  probability  perr  of  sub-optimal  action  choice 
being  bounded  as  perr  <  ae~dn ,  and  the  expected  sim¬ 
ple  regret  A  being  bounded  as  A  <  Hae~^n,  ivhere 
a  =  3 K  ( 3BK )  ,  /?  =  iK^gK)H H 2  >  and  £  the  simple 
regret  of  the  second-best  action  at  sq(H). 


procedure  UPDATe(/?) 
for  d  4—  \p\, . . . ,  1  do 
h  4—  H  —  d 
a  4—  p[d].a 
s'  4-  p[d\.s' 
n(s(h))  4—  n(s(h))  +  1 
n(s(h),a)  4—  n(s(h),a)  +  1 
n(s(h),a,  s')  4—  n(s(h),  a,  s')  +  1 
R(s(h),a)  =  R(s(h),a)  +  p[d].r 
BELLMAN-BACKUP  (s(/l),  a) 


procedure  bellman-backup(s(/i),  a) 
Q(s(h),a)  4 
V  4—  0 

for  s'  £  {s'  |  n(s(h),a,s')  >  0}  do 

V  V  +  "nJShZa)  maX°'  Q  W  ~  X)>  °') 

Q(s{h),a)  4-  Q(s(h),a)  +v 


I||(3(s(/i},a)  —  Q(s(h),a)\  >  tij 
<p{„(.W, „)<!&»»}  + 


\q  (s(h),  a)  —  Q  (s(h),a) I  >  5 


n(s(h),a)  > 


n{s{h)) 

2K 


»(«W) 

<  e“  2K2  +  2  (3 BK)h  e 


252„  (s<h» 

™  2K(4BK)hh 2 


<  ft  +  2  (3BA)M  e  K(.4BK)hh2  ; 

which  implies 

P  jjmaxQ  ( s{h),a )  —  Q  (s(/i),7r*(s))|  >  <ij 

/  ,  S2n(s(h)) 

<  K  M  +  2  (3 BK)h)  e  , 


(2) 


procedure  STOP-ROLLOUT(p) 

t  \p\ 

return  t  =  H  or  A(p[f].s')  =  0 

procedure  SELECT- ACTION(s,t)  //uniform 
return  a  ~  W[A(s)] 

procedure  SAMPLE-OUTCOME(s,a,t) 
return  s'  ~  P(5  |  s,  o) 

Figure  4:  MaxBRUE  algorithm  as  a  specific  set  of  sub¬ 
routines  for  MCTS 


Proof:  A  key  sub-claim  we  prove  first  is  that,  at  any 
iteration  of  the  algorithm,  for  all  h  £  [if],  all  states  s 
reachable  from  sq  in  H  —  h  steps,  all  actions  a  £  A(s), 
and  any  S  >  0,  it  holds  that 


P-[|Q(s(/i},a)  -  Q(s(/i),a)|  >  <ij  <2(3 KB)h  e 


252 n(s(h)  ,a) 
( 4BK)hh 2 

(1) 


The  proof  of  this  sub-claim  is  by  induction  on  h. 
Starting  with  h  =  1,  by  Hoeffding  concentration  in¬ 
equality,  we  have  that 


F{|<9  (s(l),a)  -  Q  (s(!>,  a) 


<  2e_2'52"^s^1^°^ 


Now,  assuming  Eq.  1  holds  for  h!  <  h,  we  prove  it 
holds  for  h  + 1.  From  the  induction  hypothesis,  we  have 


Denoting  pa>a>a,  =  }  and  s,  =  ps,a,s 

P(s'|s,  a),  it  holds  that 


j Q(s(h  +  1),  a)  —  Q  ( s(h  +  1),  a) 

=  5~]ps,a,B>  R  (s,a,sr)  +  maxQ  (s'(h),a) 

‘  ^  a' 

s' 

-Y,ns'\s,a)  [-R  (s,  a,  s')  +  Q  (s'(h),  7r*(s'))] 


< 


+ 


y  (Ps,ay  -  P(s'|s,a))  [R  ( s,a,s ')  +  Q  (s(h),  7r*(s'))] 

s' 

y  Ps,a,s’  maxQ  ( s'(h),a  )  -  Q  (s' (h),n*  (s')) 


and  thus 

P  ||q  (s(/i  +  l),a)  -  Q(s(h  +  l),a)j  >  dj 


y  [R(s,a,s')  +  Q  (s'(h),n*(s'))]  > 

Sf 

+  P{ypw-  maxQ  (s'(h),a)  -  Q  (s' (h),  tv*  (s'))  >  || 


_52n(s(h+l),a) 

<  2e  2('*+!£ 


-I-  y^P  <  maxQ  (s'(h),a)  —  Q  (s' (h)  ,n*  (s')) 


> 


5 

2\/BPs,a 

(3) 


In  the  last  bounding  of  Eq.  3,  the  first  term  is  due 
to  Hoeffding  and  the  second  term  is  justified  by  noting 
that  the  solution  to  the  problem 


B 


maximize 

p 


y  yfcpi  subject  to  y  Vi  =  1 


is  p  =  (s> •  •  •  >  b)>  with  value  EE  %/?  = 

From  Eq.  2  we  have 


Vp 


max Q  ( s'(h),a ')  —  Q  (s' (h), tv* (s')) 

nJ 


> 


5 

Bps,a,s' 


<2K(«'<h» 

<J2K(l  +  2(3BK)h)e 

s' 

/  .  \  a2^(a(h  +  l),g) 

<  BI<  fl  +  2  (3BK)J  e  (■4B«)h+lfe2  . 

(4) 


Returning  now  to  Eq.  3,  we  have 
IP  j|<5(s(/i+  l),o)  —  Q(s(h+  l),a)|  >  5 1 

_S2n(.s(h+  l>,a)  /  X  <S2  n(s  (7i  +  l)  ,a) 

<  2e  2"'+1>2  +  BK  ft  +  2(3BK)h)  e  <4 bk^h2 

62n(s{h+\),a) 

<  (3BK)h+1e  (4BK)h+i(M-i)2  ; 

(5) 


finalizing  the  proof  of  the  induction  step.  Now,  given 
the  sub-claim  around  Eq.  1  and  denoting  = 

\Q(so(H),a)  —  Q(so(H),a)\,  the  first  part  of  Theorem  1 
is  concluded  by 


Perr<£P{<2"ff^} 

a 

<  £  |p  jn(s0<ff>,a)  <  n(a2°^f))| 


+ 


^E 


’{0alff>|  n(s0(H),a)  >  (6) 


e  vh  +  2  (3BK)h  e 


<  3 A'  (3BK)h  e  iKUBK)HH2 


The  second  part  of  Theorem  1  is  obtained  from  the 
fact  that  the  maximal  loss  from  choosing  a  sub-optimal 
action  at  sq(H)  is  H.  ■ 


MaxBRUE+:  Controlled  Exploration 

At  any  non-terminal  node  s(h),  the  inaccuracy  of 
the  Q- value  estimators  in  Bellman-based  MaxUCT  and 
MaxBRUE  steins  from  two  sources:  The  inaccuracy  in 
the  estimation  of  the  action  transition  probabilities,  and 
the  inaccuracy  of  the  value  estimators  that  are  folded 
up  to  s(h)  from  its  immediate  successors.  The  former 
is  reduced  with  sampling  actions  at  s(h ),  and  the  latter 
is  reduced  with  sampling  s(/i)’s  successors. 

If  we  wish  to  optimize  the  formal  guarantees  on  the 
convergence  rate,  it  is  ideal  to  have  these  two  sources  of 
inaccuracy  balanced.  This  can  be  achieved  by  equating 
the  number  of  samples  of  the  node  with  the  number 
of  samples  of  its  immediate  successors.  In  fact,  this  is 
precisely  what  is  done  by  the  seminal  sparse  sampling 


procedure  STOP-ROLLOUT(p) 
d  \p\ 
h  4—  H  —  d 
s  p[d\.s 
a  -S—  p[d].a 
s'  p[d\.s' 

S(s,a)  <r-  {s'  |  n(s(h),a,s')  >  0} 
if  n  (s'(h  —  1))  >  K  ■  |S(s,  o)|  •  n  ( s(h),a ,  s')  then 
return  true 

return  (d  =  H  or  A(p[d].s')  =  0) 

Figure  5:  MaxBRUE+  as  MaxBRUE  with  a  modified 
STOP-PROBE  procedure 

algorithm  of  Kerns  et  al.  (2002)  for  PAC  (probably 
approximately  correct)  MDP  planning.  However,  the 
PAC  setting  does  not  require  smooth  improvement  of 
the  quality  of  recommendation  over  time,  and  thus  the 
nodes  can  be  sampled  in  a  systematic  manner.  In  con¬ 
trast,  in  the  online  setup,  smooth  convergence  is  crit¬ 
ical,  and  rollout-based  exploration  seems  to  serve  this 
purpose  quite  effectively.  At  the  same  time,  rollout- 
based  exploration  of  MCTS  leads  to  unbalanced  node 
sampling. 

While  we  might  see  unbalanced  node  sampling  as  an 
inevitable  cost  for  a  justified  cause,  this  is  not  entirely 
so.  The  overall  convergence  rate  with  rollout-based  ex¬ 
ploration  is  dictated  by  the  accuracy  of  the  estimates 
at  the  nodes  that  are  farther  towards  the  horizon  from 
s0(H).  Due  to  branching,  these  nodes  are  expected  to 
be  sampled  less  frequently.  However,  since  the  search 
spaces  induced  by  MDPs  most  typically  form  DAGs 
(and  not  just  trees),  a  node  with  multiple  ancestors 
might  possibly  be  sampled  more  than  each  of  its  an¬ 
cestors.  In  terms  of  the  formal  guarantees,  the  latter 
type  of  imbalance  is  worthless,  and  samples  are  better 
be  diverted  to  nodes  that  are  sampled  less  than  their 
immediate  ancestors. 

A  rather  natural  way  to  control  this  type  of  imbal¬ 
ance  is  by  modifying  the  protocol  for  stopping  a  rollout. 
Given  the  last  sample  (s,  a,  r,  s')  along  the  ongoing  roll¬ 
out,  if  the  number  of  samples  n(s{h),  a)  is  smaller  than 
the  number  of  samples  n(s'(h—  1  ),a')  of  some  action 
a'  £  A(s')  applicable  at  the  resulting  state  s',  the  roll¬ 
out  is  stopped.  Supported  by  the  formal  analysis  of 
MaxBRUE  in  the  proof  of  Theorem  1,  we  slightly  mod¬ 
ify  the  latter  condition  as  follows: 

•  We  replace  the  requirement  depicted  above  with  a 
weaker  one  whereby  the  overall  number  of  updates 
n(s'(h —  1))  of  the  resulting  state  s'  is  at  least  K 
times  larger  than  n  (s(h),a). 

•  We  multiply  the  counter  n  (s(h),a)  by  B  ■  P(s'|a,  s'). 
If  P(Sja,  s')  induces  a  uniform  distribution  over 
the  plausible  outcomes  of  a  at  s,  this  modification 
changes  nothing.  At  the  same  time,  for  more/less 
probable  outcomes  s',  this  modification  implies  a 
stronger/weaker  condition,  respectively. 

The  substitution  of  the  sub-procedure  STOP-PROBE 


Time  (sec.) 

20  x  20 


Time  (sec.) 

40  X  40 


Figure  6:  Simple  regret  reduction  over  time  in  the  Sail¬ 
ing  domain  by  different  MCTS  algorithms 

of  MaxBRUE  with  that  depicted  in  Figure  5  constitutes 
the  stratified  algorithm  MaxBRUE+. 

Note  that,  in  the  proof  of  Theorem  1,  we  make 
use  of  the  fact  that  n(s'(h—  1))  >  n(s(h),a,s')  to 
replace  the  former  with  the  latter  in  the  bounding 
in  Eq.  4.  Therefore,  enforcing  the  more  conservative 
n(s'{h  —  1))  <  K  ■  |S(s,  a) |  • n(s{h),a ,  s')  does  not  affect 
the  bound.  At  the  same  time,  it  is  easy  to  see  that 
(i)  the  two  sources  of  inaccuracy  are  balanced  when 
n(s'(h  —  1))  =  K  ■  \S(s,a)\  ■  n(s(h),a,s'),  and  (ii)  be¬ 
yond  this  point,  the  node  accuracy  is  surpassed  by  the 
accuracy  of  its  children. 

Experimental  Evaluation 

Our  empirical  study  comprises  two  sets  of  experiments, 
comparing  four  algorithms:  MaxUCT  (Keller  and 
Helmert  2013),  MpallCT,  MaxBRUE,  and  MaxBRUE+. 
In  the  first  set,  we  evaluate  the  four  algorithms  in  terms 
of  their  reduction  of  simple  regret  in  the  Sailing  do¬ 


main  (Peret  and  Garcia  2004).  In  this  domain,  a  sail¬ 
boat  navigates  to  a  goal  location  on  an  8-connected 
grid,  under  fluctuating  wind  conditions.  At  a  high  level, 
the  goal  is  to  reach  a  concrete  destination  as  quickly  as 
possible,  by  choosing  at  each  grid  location  a  neighbor 
location  to  move  to.  The  duration  of  each  such  move  de¬ 
pends  on  the  direction  of  the  move  ( ceteris  paribus,  di¬ 
agonal  moves  take  y/2  more  time  than  straight  moves), 
the  direction  of  the  wind  relative  to  the  sailing  direction 
(the  sailboat  cannot  sail  against  the  wind  and  moves 
fastest  with  a  tail  wind),  and  the  tack.  In  Figure  6, 
we  plot  the  empirical  simple  regret  for  increasing  delib¬ 
eration  times  for  two  grid  sizes,  20  x  20  and  40  x  40, 
averaged  over  2000  runs  with  varying  origin  and  goal 
locations.  It  is  interesting  to  see  that  MaxBRUE  clearly 
dominates  MaxUCT  (and  MpaUCT),  right  from  the  be¬ 
ginning,  unlike  the  case  of  BRUE  and  UCT  whereby 
UCT  performs  better  until  some  point.2  Figure  6  also 
demonstrates  the  benefit  of  the  exploration  control  of 
MaxBRUE+,  as  well  as  the  benefit  of  using  the  MPA- 
backup  of  MpaUCT. 

The  second  set  of  experiments  compares  between  the 
empirical  reward  collected  by  the  four  algorithms  on  five 
IPPC-2011  domains,  Game-of-Life,  SysAdmin,  Traffic, 
Crossing ,  and  Navigation.  (Almost  all  of  the  tasks  in 
these  domains  are  simply  too  large  for  us  to  exam¬ 
ine  simple  regret.)  From  each  domain,  we  chose  four 
tasks,  I±,  I3, 15,  /10,  with  higher  indexes  corresponding 
to  tasks  of  higher  complexity.  Each  algorithm  was  given 
a  deliberation  budget  that  decreased  linearly  from  10 
seconds  in  the  first  step  to  1  second  in  the  last  step.  For 
each  domain,  Figure  7(a)  depicts  the  general  bound  on 
the  state  branching  factor  K  and  the  action  branching 
factor  B,  as  well  as  the  specific  horizon  H  used  in  the 
experiments,  all  as  functions  of  a  domain-specific  pa¬ 
rameter  p  that  scales  linearly  with  the  instance  index. 
{H  >  10  was  used  in  goal-oriented  domains.) 

Figures  7(b-f)  plot  the  score  of  the  four  algorithms 
based  on  700  samples,  normalized  between  0  and  1  as 
their  average  improvement  over  a  trivial  baseline  of  ran¬ 
dom  action  selection.  With  the  exception  of  the  SysAd¬ 
min  domain3,  it  appears  that  the  results  are  quite  sim¬ 
ilar  for  all  algorithms.  However,  the  relative  perfor¬ 
mance  differences  seem  to  comply  with  the  analysis 
of  Bubeck  et  al.  (2010)  for  online  planning  in  MABs. 
Specifically,  Bubeck  et  al.  (2010)  observed  that,  despite 
the  superior  convergence  rate  of  uniform  sampling  in 
general,  the  bounds  provided  by  the  UCB(a)-based  ex¬ 
ploration  can  be  more  attractive  if  the  sample  allowance 
is  not  permissive  enough  with  respect  to  the  number  of 
actions  K.  In  case  of  more  general  MDPs,  the  “struc- 

2  For  simple  regret  analysis  on  the  Sailing  domain  with 
longer  deliberation  times,  we  refer  the  reader  to  Feldman  & 
Domshlak  (2013). 

3At  this  stage,  the  dramatically  superior  performance  of 
MaxBRUE+  instance  7i0  of  the  Navigation  domain  should 
be  considered  a  positive  anomaly,  and  not  given  any  deep 
generalizing  explanations. 
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Figure  7:  (a)  Structural  and  experimental  parameters  of  the  Sailing  and  IPPC-2011  domains,  and  (b-f)  scores  for 
the  different  MCTS  algorithms  as  their  normalized  improvement  over  a  trivial  baseline 


tural  complexity”  of  the  problem  is  determined  not  only 
by  K ,  but  also  by  the  action  branching  factor  B  and  the 
horizon  H .  In  that  respect,  considering  the  “structural 
complexity”  of  the  domains  depicted  in  Figure  7(a),  the 
results  in  Figures  7 (b-f)  are  in  line  with  the  relative  pros 
and  cons  of  the  uniform  and  UCB(a)  explorations. 

•  In  Sailing,  Navigation,  and  Crossing,  both  K  and 
B  grow  reasonably  slowly  with  the  size  of  the  prob¬ 
lem.  This  makes  our  fixed  time  budget  reasonably 
permissive  (and  thus  gives  the  advantage  to  uniform 
exploration)  across  the  instances. 

•  In  the  domains  of  intermediate  structural  complexity, 
Game-of-Life  and  SysAdmin,  the  same  time  budget 
appears  to  be  reasonably  permissive  on  the  smaller 
instances,  giving  an  advantage  to  uniform  explo¬ 
ration,  but  then  the  instances  grow  rather  fast,  giving 
an  advantage  to  the  UCBl-based  algorithms. 

•  In  Traffic,  both  K  and  B  grow  exponentially  fast  with 
the  size  of  the  problem,  making  the  time  budget  we 
fixed  to  be  too  small  for  the  uniform  exploration  to 
shine  even  on  the  smallest  instance. 

Summary 

By  drawing  ties  between  online  planning  in  MDPs  and 
MABs,  we  have  shown  that  the  state-of-art  MC  plan¬ 
ning  algorithms  UCT,  BRUE  and  MaxUCT,  as  well  as 


the  two  newly  introduced  MaxBRUE  and  MpaUCT,  bor¬ 
row  the  theoretical  and  empirical  properties  of  their 
MAB  counterparts.  In  particular,  we  have  proven 
that  the  exponential  convergence  of  uniform  exploration 
with  recommendation  of  the  empirically  best  action  in 
MABs  applies  also  to  MaxBRUE,  resulting  in  the  best 
known  convergence  rates  among  online  MCTS  algo¬ 
rithms  to  date.  Moreover,  in  line  with  MAB  results, 
the  superiority  of  MaxBRUE  with  a  permissive  budget 
as  well  as  the  superiority  of  the  UCB(a)-based  explo¬ 
ration  algorithms  MaxUCT  and  MpaUCT  with  a  mod¬ 
erate  budget  has  been  demonstrated  empirically.  We 
have  also  shown  that  a  particular  exploration  control 
mechanism  applied  to  MaxBRUE  substantially  improves 
its  performance.  We  believe  that  this  mechanism  and 
variations  of  it  can  be  valuable  to  other  online  planning 
algorithms  as  well.  Finally,  other  exploration  strategies 
that  are  found  appealing  in  the  context  of  MABs  can 
also  be  ”  converted”  to  MDPs  following  the  lines  of  this 
work. 
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Landmarks  in  Oversubscription  Planning 

Vitaly  Mirkis  and  Carmel  Domshlak  1 


Abstract.  In  the  basic  setup  of  oversubscription  planning  (OSP), 
the  objective  is  to  achieve  an  as  valuable  as  possible  subset  of  goals 
within  a  fixed  allowance  of  the  total  action  cost  [321.  Continuing 
from  the  recent  successes  in  exploiting  logical  goal-reachability  land¬ 
marks  in  classical  planning,  we  develop  a  framework  for  exploit¬ 
ing  such  landmarks  in  heuristic-search  OSP.  We  show  how  standard 
landmarks  of  certain  classical  planning  tasks  can  be  compiled  into 
the  OSP  task  of  interest,  resulting  in  an  equivalent  OSP  task  with 
a  lower  budget,  and  thus  with  a  smaller  search  space.  We  then  show 
how  such  landmark-based  task  enrichment  can  be  combined  in  a  mu¬ 
tually  stratifying  way  with  the  BFBB  search  used  for  OSP  planning. 
Our  empirical  evaluation  confirms  the  effectiveness  of  the  proposed 
landmark-based  budget  reduction  scheme. 

1  INTRODUCTION 

In  most  general  terms,  deterministic  planning  is  a  problem  of  find¬ 
ing  paths  in  large-scale  yet  concisely  represented  state-transition  sys¬ 
tems.  In  what  these  days  is  called  classical  planning  [11],  the  task  is 
to  find  an  as  cost-effective  path  as  possible  to  a  goal-satisfying  state. 
In  contrast,  in  what  Smith  [32]  baptized  as  “oversubscription"  plan¬ 
ning  (OSP),  the  task  is  to  find  an  as  goal-effective  (or  valuable )  state 
as  possible  via  a  cost-satisfying  path.  In  other  words,  the  hard  con¬ 
straint  of  classical  planning  translates  to  only  preference  in  OSP,  and 
the  hard  constraint  of  OSP  translates  to  only  preference  in  classical 
planning.  Finally,  in  “optimal”  classical  planning  and  OSP,  the  tasks 
are  further  constrained  to  finding  only  most  cost-effective  paths  and 
most  goal-effective  states,  respectively. 

Classical  planning  and  OSP  constitute  the  most  fundamental  vari¬ 
ants  of  deterministic  planning,  with  many  other  variants  of  deter¬ 
ministic  planning  being  defined  in  terms  of  mixing  and  relaxing  the 
two.  For  instance,  “net-benefit”  planning  tries  to  achieve  both  (classi¬ 
cal)  cost-effectiveness  of  the  path  and  (OSP)  goal-effectiveness  of  the 
end-state  by  additively  combining  the  two  measures,  but  at  the  same 
time,  it  relaxes  the  hard  constraints  of  (classical)  goal-satisfaction 
and  (OSP)  cost-satisfaction  [31,  1,  4,  2,  7,  24],  Another  popular 
setup  is  “cost-bounded”  planning,  in  which  both  (classical)  goal- 
satisfaction  and  (OSP)  cost-satisfaction  are  pursued,  but  both  (clas¬ 
sical)  cost-effectiveness  of  the  path  and  (OSP)  goal-effectiveness  of 
the  end-state  are  relaxed/ignored  [33,  34,  13,  15,  19,  12,  27], 

While  OSP  has  been  advocated  over  the  years  on  par  with  classi¬ 
cal  planning,  so  far,  the  theory  and  practice  of  the  latter  have  been 
studied  and  advanced  much  more  intensively.  The  remarkable  suc¬ 
cess  and  continuing  progress  of  heuristic-search  solvers  for  classical 
planning  is  one  notable  example.  Primary  enablers  of  this  success 
are  the  advances  in  domain-independent  approximations,  or  heuris¬ 
tics,  of  the  cost  needed  to  achieve  a  goal  state  from  a  given  state. 

1  Technion,  Haifa,  Israel,  emails:  {mirkis@tx}{dcarmel@ie[. technion.ac.il 


With  our  focus  here  on  optimal  planning,  two  classes  of  approxi¬ 
mation  techniques  have  been  found  especially  useful  in  the  context 
of  optimal  classical  planning:  those  based  on  state-space  abstrac¬ 
tions  [10,  14,  18,  22]  and  these  based  on  logical  landmarks  for  goal 
reachability  [21,  17,  9,  5,  28], 

Considering  OSP  as  heuristic  search,  a  question  is  then  whether 
some  similar-in-spirit  (yet  possibly  mathematically  different)  ap¬ 
proximation  techniques  can  be  developed  for  heuristic-search  OSP. 
Recently,  the  authors  provided  the  first  affirmative  answer  to  this 
question  in  the  context  of  abstractions  by  developing  the  actual  no¬ 
tion  of  OSP  abstractions,  investigating  the  complexity  of  working 
with  them  for  the  purpose  of  heuristic  approximation,  and  demon¬ 
strating  empirically  that  using  OSP  abstraction  heuristics  within  a 
best-first  branch-and-bound  (BFBB)  search  can  be  extremely  effec¬ 
tive  in  practice  [25],  In  contrast,  the  prospects  of  goal-reachability 
landmarks  in  heuristic-search  OSP  have  not  been  investigated  yet. 

This  is  precisely  the  contribution  of  this  paper:  First,  we  intro¬ 
duce  and  study  e-landmarks,  the  logical  properties  of  OSP  plans 
that  achieve  valuable  states.  We  show  that  e-landmarks  correspond 
to  regular  landmarks  of  certain  classical  planning  tasks  that  can  be 
(straightforwardly)  derived  from  the  OSP  tasks  of  interest.  We  then 
show  how  such  e-landmarks  can  be  compiled  into  the  OSP  task  of 
interest,  resulting  in  an  equivalent  OSP  task,  but  with  a  stricter  cost 
satisfaction  constraint,  and  thus  with  a  smaller  effective  search  space. 
Finally,  we  show  how  such  landmark-based  task  enrichment  can  be 
combined  in  a  mutually  stratifying  way  with  the  BFBB  search  used 
for  OSP  planning,  resulting  in  an  incremental  procedure  that  inter¬ 
leaves  search  and  landmark  discovery.  The  entire  framework  is  inde¬ 
pendent  of  the  OSP  planner  specifics,  and  in  particular,  of  the  heuris¬ 
tic  functions  it  employs.  Our  empirical  evaluation  on  a  large  set  of 
OSP  tasks  confirms  the  effectiveness  of  the  proposed  approach. 

2  PRELIMINARIES 

Since  both  OSP  and  classical  planning  tasks  are  discussed  in  the  pa¬ 
per,  we  use  a  formalism  that  is  based  on  the  standard  STRIPS  formal¬ 
ism  with  non-negative  operator  costs  (cf.  [17]),  extended  to  OSP  in 
line  with  the  notation  of  our  earlier  paper  on  OSP  [25]. 

Planning  Tasks.  A  planning  task  structure  is  given  by  a  pair 
(V,  O ),  where  V  is  a  finite  set  of  propositional  state  variables,  and 
O  is  a  finite  set  of  operators.  State  variables  are  also  called  proposi¬ 
tions  or  facts.  A  state  s  £  2l  is  a  subset  of  facts,  representing  the 
propositions  which  are  currently  true.  Each  operator  o  £  O  is  as¬ 
sociated  with  preconditions  pre(o)  C  V,  add  effects  add(o)  C  V, 
and  delete  effects  del(o)  C  V .  Applying  an  operator  o  in  s  results  in 
state  (s  \  del(o))  U  add(o),  which  we  denote  as  s[oJ.  The  notation 
is  only  defined  if  o  is  applicable  in  s,  i.e.,  if  pre(o)  C  s.  Applying 
a  sequence  (oi, . . . ,  Ok)  of  operators  to  a  state  is  defined  inductively 
as  s[e]  :=  s  ands[(oi, .  . .  ,ofc)J  :=  (s[(oi, . .  ■ , Ofc-i)])M. 


A  classical  planning  task  II  =  {V,  O;  I,  G,  cost)  extends  its 
structure  {V,  O)  with  an  initial  state  I  C  V,  a  goal  G  C  V,  and 
a  real- valued,  nonnegative  operator  cost  function  cost  :  O  —>  R0+ . 
An  operator  sequence  7r  is  called  an  s-plan  if  it  is  applicable  in  s,  and 
G  C  s[7rj.  The  cost  of  s-plan  7r  is  cost(n)  :=  '}f,o€n  cost(o),  and 
7r  is  optimal  if  its  cost  is  minimal  among  all  s-plans.  The  objective 
in  classical  planning  is  to  find  an  / -plan  of  as  low  cost  as  possible  or 
prove  that  no  I- plan  exists.  Optimal  classical  planning  is  devoted  to 
searching  for  optimal  / -plans  only. 

An  oversubscription  planning  (OSP)  task  II  = 
(V,0;  I,cost,u,b)  extends  its  structure  (V,0)  with  four  com¬ 
ponents:  an  initial  state  I  C  V  and  an  operator  cost  function 
cost  :  O  — >  R0+  as  above,  plus  a  succinctly  represented  and 
efficiently  computable  state  value  function  u  :  S  — >  R0+,  and  a  cost 
budget  b  £  R0+.  In  what  follows,  we  assume  u(s)  =  Y2vgs  u(v), 
i.e.,  the  value  of  state  s  is  the  sum  of  (mutually  independent)  values 
of  propositions  which  are  true  in  s.  Conceptually,  our  results  equally 
apply  to  general  value  functions,  but  the  complexity  of  certain 
construction  steps  may  vary  between  different  families  of  value 
functions. 

In  OSP,  an  operator  sequence  ir  is  called  an  s-plan  if  it  is  appli¬ 
cable  in  s,  and  J^o£7r  cost(o)  <  b.  While  even  an  empty  operator 
sequence  is  an  s-plan  for  any  state  s,  the  objective  in  OSP  is  to  find 
an  /-plan  that  achieves  as  valuable  a  state  as  possible.  By  u(ir)  we 
refer  to  the  value  of  the  end-state  of  7r,  that  is,  u(n)  =  m(s[7t]]). 
Optimal  OSP  is  devoted  to  searching  for  optimal  /-plans  only:  An 
s-plan  7 r  is  optimal  if  u(ir)  is  maximal  among  all  the  s-plans. 

Heuristics.  The  two  major  ingredients  of  any  heuristic-search 
planner  are  its  search  algorithm  and  heuristic  function.  In  classical 
planning,  the  heuristic  is  typically  a  function  h  :  2l  — >  R0+  U  {oo}, 
with  h(s)  estimating  the  cost  h *  (s)  of  optimal  s-plans.  A  heuristic  h 
is  admissible  if  it  is  lower-bounding,  i.e.,  h(s)  <  h*  (s)  for  all  states 
s.  All  common  heuristic  search  algorithms  for  optimal  classical  plan¬ 
ning,  such  as  A*,  require  admissible  heuristics. 

In  OSP,  a  heuristic  is  a  function  h  :  2V  x  R0+  — >  R0+,  with 
h(s,b)  estimating  the  value  h*(s,b)  of  optimal  s-plans  under  cost 
budget  b.  A  heuristic  h  is  admissible  if  it  is  upper-bounding,  i.e., 
h(s,b)  >  h*(s,b)  for  all  states  s  and  all  cost  budgets  b.  Here  as 
well,  search  algorithms  for  optimal  OSP,  such  as  best-first  branch- 
and-bound  ( BFBB )  discussed  later  on  in  detail,  require  admissible 
heuristics. 

Landmarks  in  Classical  Planning.  For  a  state  s  in  a  classical 
planning  task  II.  a  landmark  is  a  property  of  operator  sequences  that 
is  satisfied  by  all  s-plans  [20].  For  instance,  a  fact  landmark  for  a 
state  s  is  a  fact  that  is  true  at  some  point  in  every  s-plan.  Several 
admissible  landmark  heuristics  have  been  shown  as  extremely  effec¬ 
tive  in  optimal  classical  planning  [21,  17,  5,  28].  These  heuristics  use 
extended  notions  of  landmarks  which  are  subsumed  by  disjunctive 
action  landmarks.  Each  such  landmark  is  a  set  of  operators  such  that 
every  s-plan  contains  at  least  one  action  from  that  set.  In  what  fol¬ 
lows  we  consider  this  popular  notion  of  landmarks,  and  simply  refer 
to  disjunctive  action  landmarks  for  a  state  s  as  s-landmarks.  For  ease 
of  presentation,  most  of  our  discussion  will  take  place  in  the  context 
of  landmarks  for  the  initial  state  of  the  task,  and  these  will  simply  be 
referred  to  as  landmarks  (for  II). 

Deciding  whether  an  operator  set  L  C  O  is  a  landmark  for  clas¬ 
sical  planning  task  II  is  PSPACE-hard  [29].  Therefore,  all  land¬ 
mark  heuristics  employ  methods  for  landmark  discovery  that  are 
polynomial-time,  sound,  but  incomplete.  In  what  follows  we  as¬ 
sume  access  to  such  a  procedure;  the  actual  way  the  landmarks 
are  discovered  is  tangential  to  our  contribution.  For  a  set  C  of  s- 


landmarks,  a  landmark  cost  function  Icost  :  C  — >  R0+  is  admis¬ 
sible  if  'ff/L€Clcost(L)  <  h*(s).  For  a  singleton  set  C  =  {L}, 
lcost(L)  :=  min0gjj  cost(o)  is  a  natural  admissible  landmark  cost 
function,  and  it  extends  directly  to  non-singleton  sets  of  pairwise  dis¬ 
joint  landmarks.  For  more  general  sets  of  landmarks,  Icost  can  be  de¬ 
vised  (in  polynomial  time)  via  operator  cost  partitioning  [23],  either 
given  C,  [21],  or  within  the  actual  process  of  generating  C  [17]. 

3  “BRING  ME  SOMETHING”  LANDMARKS 

While  landmarks  play  an  important  role  in  (both  satisficing  and  op¬ 
timal)  classical  planning,  so  far  they  have  not  been  exploited  in  OSP. 
At  first  glance,  this  is  probably  no  surprise:  Since  landmarks  must 
hold  in  all  plans,  and  the  empty  operator  sequence  is  always  a  plan 
for  any  OSP  task,  the  notion  of  landmark  does  not  seem  useful  here. 

Having  said  that,  consider  the  anytime  “output  improvement” 
property  of  the  forward-search  branch-and-bound  algorithms  used 
for  heuristic-search  OSP.  The  empty  plan  is  not  interesting  there  not 
only  because  it  is  useless,  but  also  because  it  is  “found"  by  the  search 
algorithm  right  at  the  get-go.  In  general,  at  all  stages  of  the  search, 
anytime  algorithms  like  BFBB  maintain  the  best-so-far  solution  ir, 
and  prune  all  branches  that  promise  value  lower  or  equal  to  u(ir). 
Hence,  in  principle,  such  algorithms  may  benefit  from  information 
about  properties  that  are  “satisfied  by  all  plans  with  value  larger  than 
x.”  Unfortunately,  it  is  not  yet  clear  how  the  machinery  for  discover¬ 
ing  classical  planning  landmarks  can  be  adapted  to  discovery  of  such 
“value  landmarks"  while  preserving  polynomial-time  complexity  on 
general  OSPs  and  arbitrary  lower  bounds  x. 

Looking  at  what  is  needed  and  what  is  available,  our  goal  here  is 
to  exploit  this  machinery  as  it  is.  While  the  value  of  different  s-plans 
in  an  OSP  task  n  varies  between  zero  and  the  value  of  the  optimal 
s-plan  (which  may  also  be  zero),  let  an  e-landmark  for  state  s  be 
any  property  that  is  satisfied  by  any  s-plan  7 r  that  achieves  something 
valuable.  For  instance,  with  the  disjunctive  action  landmarks  we  use 
here,  if  L  C  O  is  an  e-landmark  for  s,  then  every  s-plan  7 r  with 
u(tt)  >  0  contains  an  operator  from  L.  In  what  follows,  unless  stated 
otherwise,  we  focus  on  e-landmarks  for  (the  initial  state  of)  n. 

Given  an  OSP  task  n  =  (V,  O;  I,  cost,  u,  b),  let  a  classical  plan¬ 
ning  task  ne  =  {Ve,Oe\  /e,  cost£,  Gf)  be  constructed  as  Ve  = 
V  U  {g},  Ie  =  I,  GF  =  {g},  and  Os  =  O  U  Og,  where,  for 
each  proposition  v  with  u(v)  >  0,  Og  contains  an  operator  ov  with 
pre(o„)  =  { v},  add(o„)  =  {g},  del(ot,)  =  0,  and  cost£{ov)  =  0. 
For  all  the  original  operators  o  G  O,  coste(o)  =  cost(o).  In  other 
words,  nE-  extends  the  structure  of  n  with  a  set  of  zero-cost  actions 
such  that  applying  any  of  them  indicates  achieving  a  positive  value 
in  n.  In  what  follows,  we  refer  to  ne  as  the  e-compilation  of  n. 
Constructing  n£  from  n  is  trivially  polynomial  time,  and 

Theorem  1  For  any  OSP  task  n,  any  landmark  L  for  ns  such  that 
L  C  O  is  an  e-landmark  for  n. 

With  Theorem  1  in  hand,2  we  can  now  derive  £-landmarks  for  n 
using  any  method  for  classical  planning  landmark  extraction,  such  as 
that  employed  by  the  LAMA  planner  [30]  or  the  LM-Cut  family  of 
techniques  [17,  5],  However,  at  first  glance,  the  discriminative  power 
of  knowing  “what  is  needed  to  achieve  something  valuable”  seems  to 
be  negligible  when  it  comes  to  deriving  effective  heuristic  estimates 
for  OSP.  The  good  news  is  that,  in  OSP,  such  information  can  be 
effectively  exploited  in  a  slightly  different  way. 

2  Due  to  space  limitations,  all  proofs  are  delegated  to  a  full  technical  re¬ 
port  [26], 


BFBB  (IT  =  (V,  O;  /,  cost ,  u ,  b )) 
open  :=  new  max-heap  ordered  by  f(n)  =  h(s[n\,  b  —  g(n )) 
open.insert(make-root-node(/)) 
closed :=  0;  best-cost:=  0 
initialize  best  solution  n*  :=  I 
while  not  open. empty () 

n  :=  open.pop-max() 
if  fin)  <  w(s[n*]):  break 
if  s[n]  0  closed  or  g(n)  <  best-cost(s[n]): 
closed:=  closed  U  {s[n]} 
best-cost(s[n])  :=  g(n) 
foreach  o  E  0(s[n]): 

n;  :=  make-node(s  [n]  [o] ) 
if  g(n')  >  b  or  /(n')  <  n(s[n*]):  continue 
if  tt(s[n/])  >  -u(s[n*]):  update  n*  :=  n' 
open. insert  (n') 

return  n* 

Figure  1.  Best-first  branch-and-bound  (BFBB)  search  for  OSP 

3.1  e-Landmarks  and  Budget  Reduction 

In  the  same  way  that  A *  constitutes  a  canonical  heuristic-search  al¬ 
gorithm  for  optimal  classical  planning,  anytime  best-first  branch- 
ancl-bound  (BFBB)  probably  constitutes  such  an  algorithm  for  opti¬ 
mal  OSP.3  Figure  1  depicts  apseudo-code  description  of  BFBB.  s[n] 
there  denotes  the  state  associated  with  search  node  n.  In  BFBB  for 
OSP,  a  node  n  with  maximum  evaluation  function  6(s[n]  >  b  —  g{n )) 
is  selected  from  the  OPEN  list.  The  duplicate  detection  and  reopen¬ 
ing  mechanisms  in  BFBB  are  similar  to  those  in  A*.  In  addition, 
BFBB  maintains  the  best  solution  n*  found  so  far  and  uses  it  to 
prune  all  generated  nodes  evaluated  no  higher  than  u(s[n*]).  Like¬ 
wise,  complying  with  the  semantics  of  OSP,  all  generated  nodes  n 
with  cost-so-far  g(n)  higher  than  the  problem’s  budget  b  are  also  im¬ 
mediately  pruned.  When  the  OPEN  list  becomes  empty  or  the  node 
n  selected  from  the  list  promises  less  than  the  lower  bound.  BFBB 
returns  (the  plan  associated  with)  the  best  solution  n* ,  and  if  h  is  ad¬ 
missible,  i.e.,  the  6-based  pruning  of  the  generated  nodes  is  sound, 
then  the  returned  plan  is  guaranteed  to  be  optimal. 

Now,  consider  a  schematic  example  of  searching  for  an  optimal 
plan  for  an  OPS  task  II  with  budget  b,  using  BFBB  with  an  ad¬ 
missible  heuristic  h.  Suppose  that  there  is  only  one  sequence  of  (all 
unit-cost)  operators,  n  =  (oi,  02,  •  • . ,  o&+i),  applicable  in  the  ini¬ 
tial  state  of  II,  and  that  the  only  positive  value  state  along  n  is  its 
end-state.  While  clearly  no  value  higher  than  zero  can  be  achieved  in 
II  under  the  given  budget  of  b,  the  search  will  continue  beyond  the 
initial  state,  unless  h(I ,  •)  counts  the  cost  of  all  the  6+1  actions  of 
7T.  Now,  suppose  that  6(7,  •)  counts  only  the  cost  of  { Oi , . . . ,  o&+i} 
for  some  i  >  0,  but  {01},  {02}, . . . ,  {oi-i}  are  all  discovered  to 
be  e-landmarks  for  II.  Given  that,  suppose  that  we  modify  II  by  (a) 
setting  the  cost  of  operators  01, 02,  •  •  • ,  Oi- 1  to  zero,  and  (b)  reduc¬ 
ing  the  budget  to  b  —  i  +  1 .  This  modification  seems  to  preserve  the 
semantics  of  II,  while  on  the  modified  task,  BFBB  with  the  same 
heuristic  h  will  prune  the  initial  state  and  thus  establish  without  any 
search  that  the  empty  plan  is  an  optimal  plan  for  II.  Of  course,  the 
way  II  is  modified  in  this  example  is  as  simplistic  as  the  example 
itself.  Yet,  this  example  does  motivate  the  idea  of  landmark-based 
budget  reduction  for  OSP,  as  well  as  illustrates  the  basic  idea  behind 
the  genetically  sound  task  modifications  that  we  discuss  next. 

Let  II  =  ( V ,  O;  7,  cost,  u,  b )  be  an  OSP  task,  C  =  {Li, . . . ,  Ln} 
be  a  set  of  pairwise  disjoint  e-landmarks  for  II,  and  Icost  be 
an  admissible  landmark  cost  function  from  £.  Given  that,  a  new 
OSP  task  nc  =  (Vc,Oc\  Ic,  costc,uc,bc)  with  budget  be  = 

3  BFBB  is  also  extensively  used  for  net-benefit  planning  [3,  7,  8],  as  well  as 
some  other  variants  of  deterministic  planning  [4,  6]. 


compiie-and-BFBB  (II  =  (V,  O;  7,  cost ,  u,  b )) 
nE  =  ^-compilation  of  II 
C  :=  a  set  of  landmarks  for  1 1 , 

Icost  :=  admissible  landmark  cost  function  for  C 

II  £  -  :=  budget  reducing  compilation  of  (C.  Icost )  into  II 

n*  :=  BFBB(Il£* ) 

return  plan  for  II  associated  with  n* 

Figure  2.  BFBB  search  with  landmark-based  budget  reduction 

b  —  lcost(Li)  is  constructed  as  follows.  The  set  of  variables 

Vc  =  V  U  {vLlt  ■  ■  ■  ,VLn}  extends  V  with  a  new  proposition  per 
e-landmark  in  C.  These  new  propositions  are  all  initially  true,  and 
Ic  =  7  U  {vl^  , . . . ,  VLn  }•  The  value  function  uc  =  u  remains 
unchanged — the  new  propositions  do  not  affect  the  value  of  the 
states.  Finally,  the  operator  set  is  extended  as  Oc  =  O  U  U"=t  , 
with  Or,  containing  an  operator  o  for  each  o  G  Li,  with  pre(o)  = 
pre(o)  U  {vLi},  add(o)  =  add(o),  del(o)  =  del(o)  U  {fiL*},  and, 
importantly,  costc(o)  =  cost{o)  —  lcost{Lf).  In  other  words,  Il£ 
extends  the  structure  of  II  by  mirroring  the  operators  of  each  e- 
landmark  Li  with  their  “lcost(Li)  cheaper"  versions,  while  ensur¬ 
ing  that  these  cheaper  operators  can  be  applied  no  more  than  once 
along  an  operator  sequence  from  the  initial  state.  At  the  same  time, 
introduction  of  these  discounted  operators  for  Li  is  compensated  for 
by  reducing  the  budget  by  precisely  lcost(Li),  leading  to  effective 
equivalence  between  II  and  lie. 

Theorem  2  Let  II  =  (V,  O;  7,  cost,  u,  b)  be  an  OSP  task,  C  be  a 
set  of  pairwise  disjoint  e -landmarks  for  II,  Icost  be  an  admissible 
landmark  cost  function  from  C,  and  II £  be  the  respective  budget  re¬ 
ducing  compilation  ofH.  For  every  it  for  II  with  u(tt)  >  0,  there  is 
a  plan  n £  for  Tic  with  u(nc)  =  u(n),  and  vice  versa. 

The  above  budget  reducing  compilation  of  II  to  lie  is  clearly 
polynomial  time.  Putting  things  together,  we  can  see  that  the 
compile-and-BFBB  procedure  depicted  in  Figure  2  (1)  generates 
an  e-compilation  IIe  of  II,  (2)  uses  off-the-shelf  tools  for  classical 
planning  to  generate  a  set  of  landmarks  C.  for  II,  and  an  admissible 
landmark  cost  function  Icost,  and  (3)  compiles  (C,  Icost)  into  II, 
obtaining  an  OSP  task  Il£.  The  optimal  solution  for  Il£  (and  thus 
for  II)  is  then  searched  for  using  a  search  algorithm  for  optimal  OSP 
such  as  BFBB. 

Before  we  proceed  to  consider  more  general  sets  of  landmarks,  a 
few  comments  concerning  the  setup  of  Theorem  2  are  now  probably 
in  place.  First,  if  the  reduced  budget  be  turns  out  to  be  lower  than 
the  cost  of  the  cheapest  action  applicable  in  the  initial  state,  then  no 
search  is  needed,  and  the  empty  plan  can  be  reported  as  optimal  right 
away.  Second,  zero-cost  landmarks  are  useless  in  our  compilation  as 
much  as  they  are  useless  in  deriving  landmark  heuristics  for  optimal 
planning.  Hence,  Icost  in  what  follows  is  assumed  to  be  strictly  pos¬ 
itive.  Third,  having  both  o  and  o  applicable  at  a  state  of  IIe  brings 
no  benefits  yet  adds  branching  to  the  search.  Hence,  in  our  imple¬ 
mentation,  for  each  landmark  Li  6  C  and  each  operator  o  G  Li,  the 
precondition  of  o  in  Oc  is  extended  with  {-,i>£i}.  It  is  not  hard  to 
verify  that  this  extension4  preserves  the  correctness  of  n£  in  terms 
of  Theorem  2.  Finally,  if  the  value  of  the  initial  state  is  not  zero,  that 
is,  the  empty  plan  has  some  positive  value,  then  e-compilation  ne 
of  n  will  have  no  positive  cost  landmarks  at  all.  However,  this  can 
easily  be  fixed  by  considering  as  “valuable"  only  facts  v  such  that 
both  u(v)  >  0  and  v  0  7.  For  now  we  put  this  difficulty  aside  and 
assume  that  u(e)  =  0.  Later,  however,  we  come  back  to  consider  it 
more  systematically. 

4  This  modification  requires  augmenting  our  STRIPS-like  formalism  with  neg¬ 
ative  preconditions,  but  this  augmentation  is  straightforward. 


3.2  Non-Disjoint  e-Landmarks 

While  the  compilation  He  above  is  sound  for  pairwise  disjoint  land¬ 
marks,  this  is  not  so  for  more  general  sets  of  £-landmarks.  For  ex¬ 
ample,  consider  a  planning  task  II  in  which,  for  some  operator  o, 
we  have  cost(o)  =  b,  u((o))  >  0,  and  u(7t)  =  0  for  all  other 
operator  sequences  n  f=-  (o).  That  is,  a  value  greater  than  zero  is 
achievable  in  II,  but  only  via  the  operator  o.  Suppose  now  that  our 
set  of  £-landmarks  for  II  is  C  =  {£i, . . . ,  £„},  n  >  1,  and  that  all 
of  these  £-landmarks  contain  o.  In  this  case,  while  the  budget  in  Il£ 
is  be  =  b  —  Tf_  1  lcost(Li),  the  cost  of  the  cheapest  replica  b  of  o, 
that  is,  the  cost  of  the  cheapest  operator  sequence  achieving  a  non¬ 
zero  value  in  II,  is  cost(o)  —  min"=1  lcost(Li)  >  be-  Hence,  no 
state  with  positive  value  will  be  reachable  from  Ic  in  Il£,  and  thus 
n  and  Il£  are  not  “value  equivalent"  in  the  sense  of  Theorem  2. 

Since  non-disjoint  landmarks  can  bring  more  information,  and 
they  are  typical  to  outputs  of  standard  techniques  for  landmark  ex¬ 
traction  in  classical  planning,  we  now  present  a  different,  slightly 
more  involved,  compilation  that  is  both  polynomial  and  sound  for  ar¬ 
bitrary  sets  of  £-landmarks.  Let  II  =  (V,  O;  I,  cost,  u,  b)  be  an  OSP 
task,  C  =  {£ i, . . . ,  £„}  be  a  set  of  £-landmarks  for  II,  and  Icost 
be  an  admissible  landmark  cost  function  from  £.  For  each  operator 
o,  let  £(o)  denote  the  set  of  all  landmarks  in  C.  that  contain  o.  Given 
that,  a  new  OSP  task  Il£»  =  (V£* ,  Oc * ;  Ic * ,  costc * ,  uc * ,  be * ) 
is  constructed  as  follows.  Similarly  to  Il£,  we  have  be*  =  b  — 
X)"=i  lcost(Li),  Vc*  =  V  U  {vLl,  ■  ■  ■  ,vLn},  Ic *  =  I  U 
{vli:  ■  ■  ■  and  uc*  =  u.  The  operator  set  Oc*  extends  O 

with  two  sets  of  operators: 

•  For  each  operator  o  £  O  that  participates  in  some  landmark  from 

C,  Oc *  contains  an  action  b  with  pre(b)  =  pre(o)  U  {vl  \  £  £ 
£(o)},  add(o)  =  add(o),  del(o)  =  del(o)  U  {vl  \  £  £  £(o)}, 
costc*  (o)  =  cost(o)  —  lcost(L). 

•  For  each  £  £  C,  Oc *  contains  an  action  get(L)  with 

pr e(get(L))  =  {-ivz,},  add (get(L))  =  {vL},  de\(get(L))  =  0, 
costc*  (get(L))  =  lcost(L). 

For  example,  let  £  =  {Li,  L2,  £3},  £ 1  =  {a,  6},  £2  = 
{b,  c},  £3  =  {a,  c},  with  all  operators  having  the  cost  of  2, 
and  let  lcost(Li)  =  lcost{L2)  =  lcost{Ls)  =  1.  In  lie*, 
we  have  Vc*  =  V  U  {vl1,Vl2,vl3}  and  Oc *  =  O  U 

{a,b,c,  get{L\),  get{Lf),  getfLz)},  with,  e.g.,  pre(a)  =  pre(a)  U 
{f£i ,  Vl3},  add(o)  =  add(a),  del(a)  =  del(a)  U  {vl3  ,  Vl3},  and 
costc *  (a)  =  0,  and,  for  get(L  1),  pre(ge£(£i))  =  del(§e£(£i))  = 
0,  add(ge£(£i))  =  {fCj},  and  costc *  (get(Li))  =  1. 

Theorem  3  Let  II  =  {V,  O;  J,  cost,  u,  b)  be  an  OSP  task  and  lie*  a 
budget  reducing  compilation  of  II.  For  every  it  for  II  with  u(tt)  >  0, 
there  is  a  plan  nc *  for  lie*  with  u(nc* )  =  u(n),  and  vice  versa. 

4  e-LANDMARKS  &  INCREMENTAL  BFBB 

As  we  discussed  earlier,  if  the  value  of  the  initial  state  is  not  zero,  i.e., 
the  empty  plan  has  some  positive  value,  then  the  basic  £-compilation 
ne  of  n  will  have  no  positive  cost  landmarks  at  all.  In  passing  we 
noted  that  this  small  problem  can  be  remedied  by  considering  as 
“valuable"  only  facts  v  such  that  both  u(v)  >  0  and  v  0  I.  We  now 
consider  this  aspect  of  OSP  more  closely,  and  show  how  £-landmarks 
discovery  and  incremental  revelation  of  plans  by  BFBB  can  be  com¬ 
bined  in  a  mutually  stratifying  way. 

Let  II  =  (V,  O',  I,  cost,  u,  b)  be  the  OSP  task  of  our  interest, 
and  suppose  we  are  given  a  set  of  plans  m, . . .  ,nn  for  II.  If  so, 
then  we  are  no  longer  interested  in  searching  for  plans  that  “achieve 
something,”  but  in  searching  for  plans  that  achieve  something  beyond 


inc-compile-and-BFBB  (II  =  {V,  O;  I,  cost,  u,  b )) 
initialize  global  variables: 

n*  :=  I  //  best  solution  so  far 
Sref  {  /  j  //  current  reference  states 

loop: 

n(e,Sref)  =  (e  ,  .S’reij-compilauon  of  II 
C  :=  a  set  of  landmarks  for  1 1  (..  .5^  ) 

Icost  :=  admissible  landmark  cost  function  from  C 
Il£*  :=  budget  reducing  compilation  of  (C.  Icost )  into  II 
if  inc-BFBB(rt£* ,  SIef,  n*)  =  done: 
return  plan  for  Ft  associated  with  n* 

inc-BFBB  (Ft,  5ref,  n*) 

open  :=  new  max-heap  ordered  by  f{n)  —  h(s[n\,b  —  g{n )) 
open.insert(make-root-node(/)) 
closed:=  0  best-cost:=  0: 
while  not  open.emptyO 
n  :=  open.pop-maxQ 

if  goods(s[n])  Cf-  goods(s/)  for  all  s'  £  Sre f: 

Sref  Sref  U  { 3  jtf  ]  } 
if  termination  criterion:  return  updated 
if  f(n)  <  u(s[ra*]):  break 
//... 

// similar  to  BFBB  in  Figure  1 
return  done 

Figure  3.  Iterative  BFBB  with  landmark  enhancement 

what  7ri, . . .  ,n„  already  achieve.  For  1  <  i  <  n,  let  Sj  =  /[7TiJ 
be  the  end-state  of  m,  and  for  any  set  of  propositions  s  C  V,  let 
goods(s)  C  s  be  the  set  of  all  facts  v  £  s  such  that  u(v)  >  0. 
If  a  new  plan  7r  with  end-state  s  achieves  something  beyond  what 
7Ti, . . . ,  7 rn  already  achieve,  then  goods(s)  \  goods(si)  0  for  all 

I  <  i  <  n. 

We  now  put  this  observation  to  work.  Given  an  OSP  task 

II  =  {V,  O',  I,  cost,  u,  b)  and  a  set  of  reference  states  SW  = 
{si,...,Sn}  of  II,  let  a  classical  planning  task  II(£is[ef)  = 
{Ve,  Os',  Ie,Ge,  costf}  be  constructed  as  follows.  The  variable  set 
14  =  V  U  {xi, ...  ,xn,  search,  collect}  extends  V  with  a  new 
proposition  per  state  in  SW,  plus  two  auxiliary  control  variables. 
In  the  initial  state,  all  the  new  variables  but  search  are  false,  i.e., 
Ie  =  IU{search},  and  the  goal  is  Ge  =  {®i,  •  • . ,  xn}.  The  operator 
set  Oe  contains  three  sets  of  operators:  First,  each  operator  o  £  O  is 
represented  in  Oe  by  an  operator  b,  with  the  only  difference  between 
o  and  5  (including  cost)  being  that  pre(o)  =  pre(o)  U  {search}. 
We  denote  this  set  of  new  operators  5  by  O.  Second,  for  each 
Si  £  Sref  and  each  value-carrying  fact  g  that  is  not  in  Si,  i.e.,  for 
each  g  £  goods(V)  \  Si,  Oe  contains  a  zero-cost  action  o;,g  with 
pre(oi,3)  =  {g,  collect},  add(o;,s)  =  {xi}.  del(oi,9)  =  0.  Fi¬ 
nally,  Oe  contains  a  zero-cost  action  finish  with  pr e(finish)  =  0, 
de\(finish)  =  {search},  and  add  {finish)  =  {collect}. 

It  is  easy  to  verify  that  ( 1 )  the  goal  Ge  cannot  be  achieved  without 
applying  the  finish  operator,  (2)  the  operators  o  can  be  applied  only 
before  finish,  and  (3)  the  subgoal  achieving  operators  b;iS  can  be 
applied  only  after  finish.  Hence,  the  first  part  of  any  plan  for 
determines  a  plan  for  n,  and  the  second  part  “verifies”  that  the  end- 
state  of  that  plan  achieves  a  subset  of  value-carrying  propositions 
goods(Ib)  that  is  included  in  no  state  from  Sref.5 

Theorem  4  Let  n  =  (V.  O;  I,  cost,  u,  b)  be  an  OSP  task.  Sref  = 
{si , . . . ,  sn}  C  2V  be  a  subset  of  n’x  states,  and  L  be  a  land¬ 
mark  for  H(£igre/)  such  that  L  C  O.  For  any  plan  it  for  n  such 
that  goods(/[/7r]])  \  goods(sj)  0/or  all  Si  £  Sref,  rr  contains  an 
instance  of  at  least  one  operator  from  L'  =  {o  |  b  £  £}. 

Theorem  4  allows  us  to  define  an  iterative  version  of  BFBB,  suc¬ 
cessive  iterations  of  which  correspond  to  running  the  regular  BFBB 

5  This  “plan  in  two  parts"  technique  appears  to  be  helpful  in  many  planning 
formalism  compilations;  see,  e.g.,  [24]. 


on  successively  more  informed  (e,  Srefj-compilations  of  II,  with 
the  states  discovered  at  iteration  i  making  the  (e,  5ref)-compilation 
used  at  iteration  i  +  1  more  informed.  The  respective  procedure 
inc-compile-and-BFBB  is  depicted  in  Figure  3.  This  procedure 
maintains  a  set  of  reference  states  Sre f  and  the  best  solution  so  far 
n* ,  and  loops  over  calls  to  inc-BFBB,  a  modified  version  of  BFBB. 
At  each  iteration  of  the  loop.  inc-BFBB  is  called  with  an  (e,  SKf)- 
compilation  of  II,  created  on  the  basis  of  the  current  SK f  and  n* , 
and  it  is  provided  with  access  to  both  SW  and  n* .  The  reference  set 
Sref  is  then  extended  by  inc-BFBB  with  all  the  non-redundant  value¬ 
carrying  states  discovered  during  the  search,  and  n*  is  updated  if  the 
search  discovers  nodes  of  higher  value. 

If  and  when  the  OPEN  list  becomes  empty  or  the  node  n  se¬ 
lected  from  the  list  promises  less  than  the  lower  bound,  inc-BFBB 
returns  an  indicator,  done,  that  the  best  solution  n*  found  so  far, 
across  the  iterations  of  inc-compile-and-BFBB,  is  optimal.  In  that 
case,  inc-compile-and-BFBB  leaves  its  loop  and  extracts  that  op¬ 
timal  plan  from  n*.  However,  inc-BFBB  may  also  terminate  in 
a  different  way,  if  a  certain  complementary  termination  criterion 
is  satisfied.  The  latter  criterion  comes  to  assess  whether  the  up¬ 
dates  to  Sref  performed  in  the  current  session  of  BFBB  warrant 
updating  the  (e,  Sref) -compilation  and  restarting  the  search.6  If  ter¬ 
minated  this  way,  inc-BFBB  returns  a  respective  indicator,  and 
inc-compile-and-BFBB  goes  into  another  iteration  of  its  loop,  with 
the  updated  Sref  and  n*. 

5  EMPIRICAL  EVALUATION 

We  have  implemented  a  prototype  heuristic-search  OSP  solver  on  top 
of  the  Fast  Downward  planner  [16],  The  implementation  included7: 

•  (e,  Sref)-compilation  of  OSP  tasks  II; 

•  Generation  of  disjunctive  action  landmarks  for  (e,  Sref)- 
compilations  using  the  LM-Cut  procedure  [17]  of  Fast  Downward; 

•  The  incremental  BFBB  procedure  inc-compile-and-BFBB  from 
the  previous  section,  with  the  search  termination  criterion  being 
satisfied  (only)  if  the  examined  node  n  improves  over  current 
value  lower  bound;  and 

•  An  additive  abstraction  heuristic  from  the  framework  of  Mirkis 
and  Domshlak  [25],  incorporating  (i)  an  ad  hoc  action  cost  parti¬ 
tion  over  k  projections  of  the  planning  task  onto  connected  subsets 
of  ancestors  of  the  respective  k  goal  variables  in  the  causal  graph, 
and  (ii)  a  value  partition  that  associates  the  value  of  each  goal 
(only)  with  the  respective  projection.  The  size  of  each  projection 
was  limited  to  1000  abstract  states. 

After  some  preliminary  evaluation,  we  also  added  two  (optimal¬ 
ity  preserving)  enhancements  to  the  search.  First,  the  auxiliary  vari¬ 
ables  of  our  compilations  increase  the  dimensionality  of  the  prob¬ 
lem,  and  this  is  well  known  to  negatively  affect  the  quality  of  the 
projection  abstractions.  Hence,  we  devised  the  projections  with  re¬ 
spect  to  the  original  OSP  problem  n,  and  the  open  list  was  or¬ 
dered  as  if  the  search  is  done  on  the  original  problem,  that  is,  by 
h(s[n]iv ,  b  -  g(n)  +  Y^VLts[n]  lcost(L)),  where  s[n]il  is  the 
projection  of  the  state  s[n]  on  the  variables  of  the  original  OSP  task 
n.  This  change  in  heuristic  evaluation  is  sound,  as  Theorem  3  in 
particular  implies  that  any  admissible  heuristic  for  n  is  also  an  ad¬ 
missible  heuristic  for  n£* ,  and  vice  versa.  Second,  when  a  new  node 
n  is  generated,  we  check  whether  g(n)  +  J2VLgs[n]  lcost(L)  > 

6  While  the  optimality  of  the  algorithm  holds  for  any  such  termination  condi¬ 
tion,  the  latter  should  greatly  affect  the  runtime  efficiency  of  the  algorithm. 

7  We  are  not  aware  of  any  other  domain-independent  planner  for  optimal  OSP. 


g{n')  +  'H,vL^s[n']  lcost(L),  for  some  previously  generated  node  n’ 
that  corresponds  to  the  same  state  of  the  original  problem  n,  i.e., 
s[n,]'*'v  =  s[n]'*"v  .  If  so,  then  n  is  pruned  right  away.  Optimality 
preservation  of  this  enhancement  is  established  in  [26], 

Since,  unlike  classical  and  net-benefit  planning,  OSP  lacks  stan¬ 
dard  benchmarks  for  comparative  evaluation,  we  have  cast  in  this 
role  the  STRIPS  classical  planning  domains  from  the  International 
Planning  Competitions  (IPC)  1998-2006.  This  “translation”  to  OSP 
was  done  by  associating  a  separate  unit-value  with  each  sub-goal. 
The  evaluation  included  the  regular  BFBB  planning  for  n,  solving  n 
using  landmark-based  compilation  via  compile-and-BFBB,  and  the 
straightforward  setting  of  inc-compile-and-BFBB  described  above. 
All  three  approaches  were  evaluated  under  the  blind  heuristic  and  the 
additive  abstraction  heuristic  as  above. 

Figure  4  depicts  the  results  of  our  evaluation  in  terms  of  expanded 
nodes  on  all  the  aforementioned  IPC  tasks  for  which  we  could  deter¬ 
mine  offline  the  minimal  cost  needed  to  achieve  all  the  goals  in  the 
task.  Each  task  was  approached  under  four  different  budgets,  corre¬ 
sponding  to  25%,  50%,  75%,  and  100%  of  the  minimal  cost  needed 
to  achieve  all  the  goals  in  the  task,  and  each  run  was  restricted  to  10 
minutes.  Figures  4(a)  and  4(b)  compare  the  performance  of  BFBB 
and  compile-and-BFBB  with  blind  (a)  and  abstraction  (b)  heuristics. 
Figures  4(c)  and  4(d)  provide  a  similar  comparison  between  BFBB 
and  inc-compile-and-BFBB.  s 

As  Figure  4  shows,  the  results  are  satisfactory.  With  no  infor¬ 
mative  heuristic  guidance  at  all,  the  number  of  nodes  expanded  by 
compile-and-BFBB  was  typically  much  lower  than  the  number  of 
nodes  expanded  by  BFBB,  with  the  difference  reaching  three  or¬ 
ders  of  magnitude  more  than  once.  Of  the  760  task/budget  pairs  be¬ 
hind  Figure  4a,  81  pairs  were  solved  by  compile-and-BFBB  with  no 
search  at  all  (by  proving  that  no  plan  can  achieve  value  higher  than 
that  of  the  initial  state),  while,  unsurprisingly,  only  4  of  these  tasks 
were  solved  with  no  search  by  BFBB. 

As  expected,  the  value  of  landmark-based  budget  reduction  is 
lower  when  the  search  is  equipped  with  a  meaningful  heuristic  (Fig¬ 
ure  4b).  Yet,  even  with  our  abstraction  heuristic  in  hand,  the  num¬ 
ber  of  nodes  expanded  by  compile-and-BFBB  was  often  substan¬ 
tially  lower  than  the  number  of  nodes  expanded  by  BFBB.  Here, 
BFBB  and  compile-and-BFBB  solved  with  no  search  39  and  85 
task/budget  pairs,  respectively.  Finally,  despite  the  rather  ad  hoc  set¬ 
ting  of  our  incremental  inc-compile-and-BFBB  procedure,  switch¬ 
ing  from  compile-and-BFBB  to  inc-compile-and-BFBB  was  typi¬ 
cally  beneficial,  though  much  deeper  investigation  and  development 
of  inc-compile-and-BFBB  is  obviously  still  required. 
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Abstract 

In  deterministic  OSP.  the  objective  is  to  achieve  an  as  valu¬ 
able  as  possible  subset  of  goals  within  a  fixed  allowance  of 
the  total  action  cost.  Although  numerous  applications  in  var¬ 
ious  fields  share  this  objective,  no  substantial  algorithmic  ad¬ 
vances  have  been  made  beyond  the  very  special  settings  of 
net-benefit  optimization.  Tracing  the  key  sources  of  progress 
in  classical  planning,  we  identify  a  severe  lack  of  domain- 
independent  approximations  for  OSP.  and  start  with  inves¬ 
tigating  the  prospects  of  abstraction  approximations  for  this 
problem.  In  particular,  we  define  the  notion  of  additive  ab¬ 
stractions  for  OSP,  study  the  complexity  of  deriving  effective 
abstractions  from  a  rich  space  of  hypotheses,  and  reveal  some 
substantial,  empirically  relevant  islands  of  tractability. 

Introduction 

In  deterministic  planning,  the  basic  structure  of  acting  with 
underconstrained  or  overconstrained  resources  is  respec¬ 
tively  captured  by  classical  planning  and  oversubscription 
planning.  In  classical  planning,  all  goals  must  be  achieved 
at  as  low  a  total  cost  of  the  actions  as  possible.  In  oversub¬ 
scription  planning  (OSP),  an  as  valuable  as  possible  subset 
of  goals  should  be  achieved  within  a  fixed  allowance  of  the 
total  action  cost.  While  both  theory  and  practice  of  classical 
planning  have  been  rapidly  advancing,  progress  in  OSP  has 
been  mostly  in  the  direction  of  net-benefit  planning.  In  net- 
benefit  planning,  no  explicit  restriction  is  put  on  the  plan 
cost,  and  the  action  costs  and  goal  utilities  are  assumed  to 
be  comparable,  with  the  objective  being  maximizing  the  dif¬ 
ference  between  the  cumulative  value  of  the  achieved  goals 
and  the  cost  invested  in  achieving  them.  Although  there  are 
numerous  interesting  algorithms  for  net-benefit  planning,  it 
was  recently  shown  to  be  polynomial-time  reducible  to  clas¬ 
sical  planning  (Keyder  and  Geffner  2009).  As  such,  it  con¬ 
stitutes  an  extremely  special  variant  of  oversubscription. 

A  closer  look  shows  that  the  recent  progress  in  clas¬ 
sical  planning  stems,  to  a  large  extent,  from  advances  in 
domain-independent  approximations,  or  heuristics,  of  the 
cost  needed  to  achieve  all  the  goals  from  a  given  state.  It  is 
thus  possible  that  having  a  similarly  rich  palette  of  effective 
heuristic  functions  for  OSP  would  advance  the  state-of-the- 
art  in  that  problem.  In  principle,  the  reduction  of  Keyder 
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and  Geffner  (2009)  from  net-benefit  to  classical  planning 
can  be  used  to  reduce  OSP  to  classical  planning  with  nu¬ 
meric  state  variables  (Fox  and  Long  2003;  Helmert  2002). 
So  far,  however,  progress  in  classical  planning  with  numeric 
state  variables  has  mostly  been  achieved  along  delete  re¬ 
laxation  heuristics  (Hoffmann  2003;  Edelkamp  2003),  and 
these  heuristics  do  not  preserve  information  on  consum¬ 
able  resources:  the  “negative”  action  effects  that  decrease 
the  values  of  numeric  variables  are  ignored,  possibly  up  to 
some  special  handling  of  so-called  “cyclic  resource  trans¬ 
fer”  (Coles  et  al.  2008). 

In  this  work  we  make  first  steps  towards  effective  heuris¬ 
tics  for  OSP,  and  in  particular,  towards  admissible  abstrac¬ 
tion  heuristics  for  this  problem.  In  classical  planning, 
state-space  abstractions  are  among  the  most  prominent  tech¬ 
niques  for  devising  admissible  heuristics  (Edelkamp  2002; 
Haslum  et  al.  2007;  Helmert,  Haslum,  and  Hoffmann  2007; 
Katz  and  Domshlak  2010a).  Departing  from  the  most  basic 
question  of  what  state-space  abstractions  for  OSP  actually 
are  (and  what  they  are  not),  we  show  that  the  very  notion 
of  abstraction  substantially  differs  in  classical  and  in  OSP. 
We  define  additive  abstractions  and  abstraction  heuristics  for 
OSP,  and  investigate  computational  complexity  of  deriving 
effective  abstraction  heuristics  in  the  scope  of  homomorphic 
abstraction  skeletons,  paired  with  cost,  value,  and  budget 
partitions.  Along  with  revealing  some  significant  islands 
of  tractability,  we  expose  an  interesting  interplay  between 
knapsack-style  problems,  convex  optimization,  and  princi¬ 
ples  borrowed  from  explicit  abstractions  for  classical  plan¬ 
ning.  We  believe  that  this  interplay  opens  the  road  to  much 
further  research. 

Formalism  and  Background 

In  line  with  the  SAS+  formalism  for  deterministic  plan¬ 
ning  (Backstrom  and  Klein  1991;  Backstrom  and  Nebel 
1995),  a  planning  task  structure  is  given  by  a  pair  (V,  A), 
where  V  is  a  set  of  n  finite-domain  state  variables,  and  A 
is  a  finite  set  of  actions.  Each  complete  assignment  to  V  is 
called  a  state,  and  S  =  dom(v\)  x  •  •  •  x  dom(vn )  is  the 
state  space  of  the  structure  (V,  A).  Each  action  a  is  a  pair 
(pre(a),  eff(a))  of  partial  assignments  to  V  called  precon¬ 
ditions  and  effects,  respectively.  Denoting  by  V(j>)  C  V 
the  subset  of  variables  instantiated  by  a  partial  assignment 
p,  action  a  is  applicable  in  a  state  s  iff  s[u]  =  pre(a)[u]  for 


all  v  £  V(pre(a)).  Applying  a  changes  the  value  of  each 
v  £  V(eff (a))  to  eff(a)[v].  The  resulting  state  is  denoted 
by  s[a];  by  s[(ai, . . . ,  a*,}]  we  denote  the  state  obtained 
from  sequential  application  of  the  (applicable  in  turn)  ac¬ 
tions  ai, ...  ,a,k  starting  at  state  s. 

In  classical  planning,  a  planning  task  II  =  (V,  A;  sq,  G,  c) 
extends  its  structure  with  an  initial  state  s0  £  S,  a  goal 
specification  G,  typically  modeled  as  a  partial  assignment 
to  V ,  and  an  action  cost  function  c  :  A  — >■  R0+.  An  ac¬ 
tion  sequence  p  is  called  an  s-plan  if  it  is  applicable  in  s, 
and  G  C  s[p],  An  s-plan  is  optimal  if  the  sum  of  its  action 
costs  is  minimal  among  all  s-plans.  The  objective  in  classi¬ 
cal  planning  is  to  find  an  So-plan  of  as  low  cost  as  possible, 
with  optimal  classical  planning  being  devoted  to  searching 
for  optimal  so-plans  only. 

In  contrast,  a  oversubscription  planning  (OSP)  task  II  = 
(V,  A ;  So,  c,  u,  b)  extends  its  structure  with  four  components: 
an  initial  state  sq  £  S  and  an  action  cost  function  c  :  A 
R0+  as  above,  plus  a  succinctly  represented  and  efficiently 
computable  state  value  function  u  :  S  — >  R0+,  and  a  cost 
budget  b  £  R0+.  An  action  sequence  p  is  called  an  s-plan  if 
it  is  applicable  in  s,  and  ^2aep  c(a)  <  6;  by  ii(p)  we  refer 
to  the  value  of  the  end-state  of  p,  that  is,  u(p)  =  rt(s[p]). 
While  empty  action  sequence  is  an  s-plan  for  any  state  s, 
the  objective  in  oversubscription  planning  is  to  find  an  so- 
plan  that  achieves  as  valuable  a  state  as  possible,  and  opti¬ 
mal  oversubscription  planning  is  devoted  to  searching  for 
optimal  so-plans  only:  An  s-plan  p  is  optimal  if  u(p)  is 
maximal  among  all  the  s-plans,  and  if  p  is  optimal,  then 

h*{s)  d=  u(p). 

Each  planning  task  II  induces  a  state-transition  model,  or 
transition  graph.  Following  Katz  and  Domshlak  (2010b),  we 
distinguish  between  the  actual  node/edge-weighted  transi¬ 
tion  graphs,  and  their  weights-omitted,  qualitative  skeletons, 
referred  to  as  transition  graph  structures.  Informally,  the  lat¬ 
ter  capture  the  dynamics  of  the  planning  tasks,  while  the 
former  associate  these  dynamics  with  “performance  mea¬ 
sures”  (Russell  and  Norvig  2009).  A  transition  graph  struc¬ 
ture  (or  tg-structure )  is  a  triplet  T  =  ( S,L,Tr )  where  S 
is  the  finite  set  of  states,  L  is  the  finite  set  of  labels,  and 
Tr  C  S  x  L  x  S  is  a  set  of  labeled  state  transitions.  Each 
tg-structure  T  =  ( S,L,Tr )  implicitly  defines  a  space  of 
performance  measures  that  can  be  associated  with  it.  In  the 
context  of  OSP,  this  space  constitutes  C  x  U  x  B  where 
C  is  the  set  of  all  functions  from  labels  L  to  R0+,  U  is  the 
set  of  all  functions  from  states  S  to  R0+,  and  B  =  R0+. 
A  transition  graph  (or  t-graph)  $  =  (T,  c,  u,  b)  asso¬ 
ciates  a  tg-structure  T  with  a  specific  performance  measure 
(c,u,b)  £  C  x  U  x  B.  A  path  from  state  s  along  the  transi¬ 
tions  of  T  is  an  s-plan  for  $  if  ]T^g  ;  s,\e7T  c(l)  <  b. 

The  tg-structure  T(II)  induced  by  a  planning  task  II  = 
(V,  A]  So,  c,u,b)  is  induced  by  the  structure  (V.  A)  of  the 
latter:  the  states  and  labels  of  T(II)  are  states  S  =  dom(V) 
and  actions  A  of  II,  respectively,  and  (s,a,  s[a])  £  Tr 
iff  action  a  is  applicable  in  state  s.  The  t-graph  induced 
by  a  planning  task  II  =  (V,  A;  so,c,u,b)  is  $(n)  = 
(T(II),  c,  u,  b).  Since  there  is  an  obvious  correspondence 
between  the  s-plans  for  II  and  the  s-plans  for  $(11),  search¬ 


ing  in  $(11)  corresponds  to  planning  for  II  via  state-space 
search,  and  heuristic-search  such  procedures  employ  heuris¬ 
tic  functions  to  estimate  the  relative  attractiveness  of  various 
parts  of  the  t-graph  $(II).  A  useful  heuristic  function  must 
be  both  efficiently  computable  from  the  planning  task,  as 
well  as  relatively  accurate  in  its  estimates.  Improving  the  ac¬ 
curacy  of  a  heuristic  function  without  substantially  worsen¬ 
ing  the  time  complexity  of  computing  it  translates  into  faster 
search  for  plans. 

In  classical  planning,  numerous  approximation  tech¬ 
niques,  such  as  monotonic  relaxation,  critical  trees,  log¬ 
ical  landmarks,  and  abstractions,  have  been  translated  to 
extremely  useful  heuristic  functions,  and  different  heuris¬ 
tics  for  classical  planning  can  also  be  combined  into  their 
point- wise  maximizing  and/or  additive  ensembles.1  Unfor¬ 
tunately,  while  some  of  these  ideas  have  also  been  translated 
to  classical  planning  with  numeric  state  variables,  the  result¬ 
ing  heuristics  do  not  appear  useful  for  OSR  Approaching  the 
need  for  effective  heuristics  for  OSP,  here  we  focus  on  ab¬ 
stractions  for  OSP,  from  their  very  definition  and  properties, 
to  the  prospects  of  deriving  (admissible)  abstraction  heuris¬ 
tics. 

Abstractions  for  OSP 

The  term  “abstraction"  is  usually  associated  with  simplify¬ 
ing  the  original  system,  factoring  out  details  less  crucial  in 
the  given  context.  In  classical  planning,  the  abstract  t-graphs 
are  required  not  to  increase  the  distances  between  the  (ab¬ 
stracted)  states  (Katz  and  Domshlak  2010b),  and  such  “dis¬ 
tance  conservation”  is  in  particular  guaranteed  by  homomor¬ 
phic  abstractions,  obtained  by  systematically  contracting 
sets  of  states  into  single  abstract  states  (Helmert,  Haslum, 
and  Hoffmann  2007).  In  turn,  an  additive  abstraction  in  clas¬ 
sical  planning  is  a  set  of  abstractions,  inter-constrained  to 
jointly  not  overestimate  the  state-to-state  costs  of  the  original 
task.  As  we  now  show,  the  concept  of  (additive)  abstractions 
in  OSP  is  very  different,  and,  for  better  and  for  worse,  has 
many  more  degrees  of  freedom  than  the  respective  concept 
in  classical  planning. 

For  k  £  N+,  by  [fc]  we  denote  the  set  {1,2,...,  k}.  Let 
T  =  (S,  L,Tr)  be  a  tg-structure,  and  let  7)  =  (Si,  Lt,  Trf), 
i  £  [k],  be  a  set  of  some  tg-structures,  each  related  to  T 
via  some  state  mapping  a,;  :  S  — >  Si.  Such  a  set  of  tg- 
structure/state-mapping  pairs  AN  =  {(7),  cti)}ieru  is  what 
is  called  an  abstraction  skeleton  for  T  (Katz  and  Domshlak 
2010b).  Now,  if  Ci  x  Ui  x  Bt  is  the  performance  measure 
space  of  Ti.  then  C  x  U  x  B,  with  C  =  xCj,  U  =  xUi, 
and  B  =  xBi,  is  the  joint  performance  measure  space  of 
AN.  That  is,  any  choice  of  (c,  u,  b)  £  C  x  U  x  B  induces 
a  set  of  t-graphs  {(7),  c[z],  u[z],  b[i])}ig[fc].  In  turn,  once 
T  is  associated  with  a  performance  measure  (c,  u,  b),  each 
joint  performance  measure  (c,  u,  b)  of  AN  either  does  or 
does  not  constitute  an  (additive)  abstraction  of  the  t-graph 
$  =  (T,  c,  u,  b ).  In  Definition  1  we  capture  this  relation  at 
even  a  more  refined  level — with  respect  to  a  specific  state  of 


'For  a  comparative  survey  and  pointers  to  the  literature,  we  re¬ 
fer  the  reader  to  Helmert  and  Domshlak  (2009). 


Figure  1 :  Illustration  for  our  running  example 

interest  in  $ — and  we  do  that  directly  in  terms  of  t-graphs 
induced  by  OSP  tasks. 

Definition  1  (Additive  Abstraction) 

Let  II  =  [V.  A ;  s,  c,  u,  b )  be  an  OSP  task,  AS  = 

{(7 be  an  abstraction  skeleton  for  T (If),  and 
(c,  u,  b)  be  a  joint  performance  measure  for  AS.  The  set 
of  t-graphs  A(c>u,b)  =  {(' 7) ,  c[i],  u[i],  b[i])}ie[fc]  is  an  (ad¬ 
ditive)  abstraction  for  II,  denoted  by  Ac.ubj  AS,  if 

h*(s)  <  hA(a  u  b)(s)  =  K(ai(s)), 

*e[fc] 

i.e.,  when  hA{c  u  b  (s)  is  an  admissible  estimate  ofh*(s). 

In  simple  terms,  a  set  of  abstractions  in  OSP  is  constrained 
to  jointly  not  underestimate  the  value  that  can  be  obtained 
from  a  concrete  state  of  the  original  task  within  a  given  cost 
budget.  For  example,  let  T  =  ({si}te[5],  {7i}ie[5],7>)  in 
Figure  la  be  a  tg-structure  of  some  OSP  task  II  with  initial 
state  si,  and  AS  =  {(7i,  «i),  (72,  ot 2)}.  with  tg-structures 
7j ,  7ji  as  in  Figure  lb  and  state  mappings 

sl,  i  £  {2, 4}  ,  -.  _  ( s§,  i  =  3 

I  1  •  (Si  )  —  v  2  1 

Si ,  otherwise  I  Si ,  otherwise. 

Let  t-graphs  $(II)  =  (T(Jl),c,u,b),  $1  =  (71,  cu  uu  bf), 
$2  =  (T2,c2,u2,b2)  be  defined  via  label  cost  functions 
c, Ci,C2  that  associate  all  labels  with  a  cost  of  1,  budgets 
b  =  b\  =  b-2  =  2,  and  state  value  functions  u,  U\ ,  U2  that 
evaluate  to  zero  on  all  states  except  for  S5,  S5,  s§,  on  which 
they  respectively  evaluate  to  one.  Considering  the  state  Si  of 
II,  the  optimal  si-plan  for  II  is  7 r  =  ((s1;  l2,  S3),  (s3, 14,  s5)) 
with  u(t r)  =  1.  The  optimal  cti(si)-plan  for  $1  is  tti  = 
((sj,  h,  Sg))  with  Ui (7Ti)  =  1,  and  the  optimal  a2(si)-plan 
for  $2  is  7 r2  =  ((sf,  l2,  s§)),  with  u2(n2)  =  1.  Since 
u(n)  <  (tti )  +  u2(n2),  A  =  {$1,  $2}  is  an  additive 
abstraction  for  II. 

Theorem  1  For  any  OSP  task  II  =  (V,  A;  s,  c,  u.  b),  any 
abstraction  skeleton  AS  ofT(Il),  and  any  A  £s  AN,  if 
the  t-graphs  of  A  are  given  explicitly,  then  hA(s)  can  be 
computed  in  time  polynomial  in  ||II||  and  ||A||. 

The  proof  is  straightforward:  Let  A  =  {3>?:}ie[fc]'  with 
<t>,  =  ( Tl,Ci,Ui,bi ),  be  an  additive  abstraction  for  II.  For 
i  €  [k],  let  S-  =  {s'  e  Si  |cj(aj(s),  s')  <  6j}.  Since  A 
is  given  explicitly,  computing  shortest  paths  from  a.i(s)  to 
all  states  in  7),  and  thus  computing  Sf  can  be  done  in  time 
polynomial  in  ||A||  for  all  i  £  [/c].  If  tti  is  an  optimal  a^(s)- 
plan  for  then  by  Definition  1,  Ui(tti)  =  maxes'  Ui(s'), 
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Figure  2:  Fragments  of  restricted  optimization  over  A(s). 

and  thus  computing  hA(s)  =  ^(7ri)  is  polynomial 

time  in  ||A||. 

While  Theorem  1  is  positive,  it  establishes  only  a  nec¬ 
essary  condition  for  the  relevance  of  OSP  abstractions  to 
practice.  Given  an  OSP  task  II,  and  having  fixed  an  ab¬ 
straction  skeleton  T  with  a  joint  performance  measure  space 
C  x  U  x  B,  for  each  state  of  interest  s,  we  should  be 
able  to  automatically  identify  an  abstraction  that  provides 
us  with  as  accurate  (aka  as  low)  an  estimate  as  possible. 
Let  A(s)  C  C  x  U  x  B  be  the  subset  of  joint  perfor¬ 
mance  measures  that  constitute  abstractions  for  II.  Note 
that  A(s)  is  not  a  combinatorial  rectangle  in  C  x  U  x  B. 
For  instance,  consider  t-graph  $(11),  state  Si  of  <b ( 1 1 ) ,  and 
abstraction  skeleton  AN  from  our  running  example.  Let 
c  £  C  be  a  cost  function  vector  with  both  c[l]  and  c[2] 
being  constant,  unit-cost  functions,  and  two  performance 
measures  (c,  u,  b),  (c,  u',  b')  £  C  x  U  x  B  being  de¬ 
fined  via  budget  vectors  b  =  |b[l]  =  2,b[2]  =  0}  and 
b'  =  {t/[l]  =  0,  b'[2]  =  2},  and  value  function  vec¬ 
tors  u  and  u',  with  u[l],  u[2],  u'[l],  and  u'[2]  evaluating 
to  zero  on  all  states  except  for  u[l](s5)  =  u'[2](s|)  =  1. 
It  is  easy  to  verify  that  (c,  u,  b),  (c,  u',b')  £  A(si),  yet 
(c,u',b),  (c,  u,  b')  £  A(sr). 

We  now  proceed  with  considering  a  specific  family  of  ad¬ 
ditive  abstractions,  reveal  some  of  its  interesting  properties, 
and  show  that  it  contains  substantial  islands  of  tractability. 
We  break  down  and  approach  the  overall  agenda  of  com¬ 
plexity  analysis  of  abstraction-based  heuristic  functions  un¬ 
der  fixation  of  some  of  the  three  dimensions  of  A(s):  If, 
for  instance,  we  are  given  a  vector  of  value  functions  u  that 
is  known  to  belong  to  the  projection  of  A (s)  on  U,  then  we 
can  search  for  a  quality  abstraction  from  the  abstraction  sub¬ 
set  7f(_  u ,^)(s)  C  A(s),  corresponding  to  the  projection  of 
A(s)  on  {u}.  As  we  show  below,  even  some  constrained 
estimate  optimizations  of  this  kind  can  be  challenging.  The 
lattice  in  Figure  2  depicts  the  range  of  options  for  such  con¬ 
strained  optimization;  at  the  extreme  settings,  T/)_ __ ,-)(s) 
is  simply  a  renaming  of  A(s),  and  L(c  u  b)(s)  corresponds 
to  a  single  abstraction  (c,  u,  b)  £  A(s). 

Partitions  and  Homomorphic  Abstractions 

With  Definition  1  allowing  for  very  general  abstraction 
skeletons,  in  this  work  we  focus  on  homomorphic  abstrac¬ 
tion  skeletons1-.  Given  a  tg-structure  T  =  ( S,L,Tr ),  an 


2A11  the  results  also  hold  verbatim  for  the  more  general  “la¬ 
beled  paths  preserving”  abstraction  skeletons  studied  by  Katz  and 
Domshlak  (2010b)  in  the  context  of  optimal  classical  planning. 


Figure  3:  Homomorphic  abstraction  skeleton  for  T(n)  in 
Figure  1. 

abstraction  skeleton  AS  =  {77,a;}ig[fe]  of  T  is  homomor¬ 
phic  if,  for  i  G  [fc],  Li  =  L,  and  ( s,l,s ')  £  Tr  only  if 
(ai(s),  /,  ai(s'))  £  Tti.  In  our  running  example,  the  ab¬ 
straction  skeleton  depicted  in  Figure  lb  is  not  homomorphic, 
but  its  slight  extension  as  in  Figure  3,  ceteris  paribus,  is  ho¬ 
momorphic.  Furthermore,  we  focus  on  a  fragment  of  addi¬ 
tive  abstractions 

Ap(s)  =  A(s)  n  [C p  x  Up  x  Bp] , 

where  Cp  C  C,  Up  C  U,  and  Bp  C  B  correspond  to 
cost,  value,  and  budget  partitions ,  respectively.  In  what  fol¬ 
lows,  by  H%  we  refer  to  Hx  n  A p(s);  e.g.,  H^_  u  _<j  = 

7f(_,u.-)(s)  H  Ap(s).  Given  a  t-graph  $  =  ( T,c,u,b ), 
a  homomorphic  abstraction  skeleton  AS  =  {7 7,ct;}igjfcj 
of  T,  and  c  £  C,  we  have  c  £  Cp  iff,  for  each  la¬ 
bel  l  in  T,  ]Cjg[/c]  c  [*](/)  <  c(Z).  Similarly,  b  G  Bp 
iff  t*[*]  <  6,  and  (note  the  change  in  the  direc¬ 

tion  of  the  inequality)  u  £  Up  iff,  for  each  state  s  in  T, 

£ie[fc]  “[*](“<(«))  ^  u(s)’ 

Theorem  2  below  establishes  a  “completeness”  relation¬ 
ship  between  the  sets  C;,  and  B,,,  as  well  as  an  even  stronger 
“completeness”  of  Cp  and  Bp.  In  particular,  it  implies  that, 
for  all  states  s,  the  projections  of  Ap(s)  on  Cp,  Up,  and  Bp 
are  the  entire  sets  Cp,  Up,  and  Bp,  respectively. 

Theorem  2  Given  an  OSP  task  n  =  ( V,  A;  s,  c,  u,  b)  and 
a  homomorphic  abstraction  skeleton  AS  =  {"77,  }  j6  [fc]  °f 

T(U), 

( 1 )  for  each  action  cost  partition  c  £  Cp,  there  exists  a 
budget  partition  b  £  Bp  such  that  A(CiU, b)  Gs  AS  for 
all  u  £  Up. 

(2)  for  each  budget  partition  b  £  Bp,  there  exists  an  action 
cost  partition  c  £  Cp  such  that  A{c,m,  b)  Gs  AS  for  all 
u  £  Up. 

Proof:  Let  n  =  ((s,ai,si),  (si,a2,s2), . . . ,  (s„_i,  a„,  sn)) 

be  an  optimal  s-plan  in  <!>(! I ).  Given  that  AS  is 
homomorphic,  let  n\ , . . . ,  n^  be  the  projections  of 
7r  on  71, . . .  ,7fc,  respectively,  that  is,  for  i  £  [fc], 

^  =  (( a.i(s ),  or,  a»(si)), . . . ,  (a,:(s„_i),  an,  ai(sn))) . 

(1)  Let  budget  profile  b*  £  B  be  defined  as  b*[i]  = 
cH(ai)-  f°r  *  e  [&]■  First,  note  that  b*  £  Bp  since 

b*[*l  =  cW(«i)  ^  c(ai) 

ie[k]  ie[k]je[n ]  j'e[n] 

where  (*)  is  by  c  being  an  action  cost  partition,  and  (**)  is 
by  7r  being  an  s-plan  for  Till).  Second,  for  any  u  £  U, 
by  the  construction  of  b*,  7 t*  is  an  a,(s)-plan  for  the  t- 
graph  (77,c[z],u[i],b*[i]).  Now,  let  u  £  Up,  and  for 


i  £  [k],  let  7 r*  be  an  optimal  ai(s)-plan  for  that  t-graph 
(77,c[i],u[z],b*[z]).  We  have 

u[*](tt*)  >  u[i](7Ti)  >  (1) 

where  (*)  is  by  optimality  of  tt* ,  and  (**)  is  by  ni  being  the 
projection  of  7r  and  u  £  Up.  Therefore,  (c,  u,  b*)  induces 
an  additive  abstraction  for  IT,  that  is,  Ac.u  b-  j  AN. 

(2)  Let  cost  function  profile  c*  £  C  be  defined  as 
c*[z](a)  =  c(a)  •  for  all  actions  a  £  A,  and  all 
i  £  [fc].  First,  we  have  c*  £  Cp  since  b  £  Bp  im¬ 
plies  1/6  X7ig [fc]  b[i]  G  [0, 1].  Second,  for  any  u  €  U,  by 
our  construction  of  c*,  7r,;  is  an  aj(s)-plan  for  the  t-graph 
(7),  c*[i],  u[z],  b[i]).  Following  now  exactly  the  same  line 
of  reasoning  as  the  one  around  Eq.  1  above  accomplishes 
the  proof  that  A(c».U!b)  Gs  AN  for  any  u  G  Up.  □ 

Again,  an  important  corollary  of  Theorem  2  is  that,  for 
all  states  s,  the  projections  of  A p(s)  on  Cp,  Up,  and  Bp 
are  the  entire  sets  Cp,  Up,  and  Bp,  respectively.  A  pri¬ 
ori,  this  property  should  simplify  the  task  of  abstraction  op¬ 
timization,  and  later  we  show  that  this  is  indeed  the  case. 
However,  complexity  analysis  of  abstraction  optimization  in 
most  general  terms  is  still  problematic  because  OSP  formal¬ 
ism  is  parametric  in  the  representation  of  value  functions. 
Hence,  as  a  first  step,  we  restrict  our  attention  to  a  fragment 
of  A p(s)  in  which  all  abstract  value  functions  are  what  we 
call  0-binary:  A  real-valued  function  /  is  a  0-binary  if  it  has 
image  img(/)  =  {0,  Vf}  for  some  Vf  £  R.  A  set  of  0-binary 
functions  F  is  called  strong  if  Vf  =  vp  for  all  /,  /'  G  F. 
On  the  one  hand,  0-binary  functions  constitute  rather  a  basic 
family  of  value  functions.  Hence,  if  abstraction  optimization 
is  hard  for  them,  it  is  likely  to  be  hard  for  any  non-trivial 
family  of  abstract  value  functions.  On  the  other  hand,  0- 
binary  abstract  value  functions  seem  to  fit  well  abstractions 
of  planning  tasks  in  which  value  functions  are  linear  com¬ 
binations  of  indicators,  each  representing  achievement  of  a 
“goal  value”  for  some  state  variable. 

A p(s)  and  0-Binary  Value  Partitions 

Important  roles  in  what  follows  are  played  by  a  well-known 
Knapsack  problem,  as  well  as  some  tools  from  convex  opti¬ 
mization.  In  a  Knapsack  problem  ({wj,  cr;}i6[n],  W),  W  is 
a  weight  allowance,  [?i]  is  a  set  of  objects,  and  each  i  £  [n] 
has  a  weight  Wi  and  a  value  <7r.  The  objective  is  to  find 
a  subset  X  C  [n]  that  maximizes  x  <Ji  over  all  sub¬ 
sets  X'  C  [?i]  with  X/ig.v'  wi  —  W7-  Fy  strict  Knapsack 
we  refer  to  a  variant  of  Knapsack  in  which  that  inequality 
constraint  is  strict.  Knapsack  is  NP-hard,  but  there  exist 
pseudo-polynomial  algorithms  for  it  that  run  in  time  poly¬ 
nomial  in  the  description  of  the  problem  and  in  the  unary 
representation  of  W  (Garey  and  lohnson  1978).  The  latter 
property  makes  solving  Knapsack  practical  in  many  applica¬ 
tions  where  the  ratio  . " —  is  reasonably  low.  Likewise,  if 
Oi  =  oj  for  all  i,j  £  [n],  then  a  greedy  algorithm  solves  the 
problem  in  linear  time  by  iteratively  expanding  X  by  one  of 
the  weight-wise  lightest  objects  in  [?r]  \  X,  until  X  cannot 
be  expanded  any  further  within  IF . 


Let  s  be  a  state  of  an  OSP  task  II,  AS  be  an  (explicitly 
given)  homomorphic  abstraction  skeleton  of  T (II),  and  sup¬ 
pose  that  we  fix  a  value  partition  u  £  Up.  By  Theorem  2, 
H^_  u  _}(s)  is  not  empty,  and  thus  we  can  try  computing 

min(c,u,b)efff_  u  _}(S)  fyc,„,b)(s)-  As  of  yet-  however,  we  do 
not  know  whether  this  task  is  polynomial-time  solvable  for 
any  non-trivial  class  of  value  partitions.  In  fact,  despite  that 
//)'  u  ;i  (s)  is  known  to  be  non-empty,  and  so,  too,  are  all 
of  its  subsets  H^_  u  bj(s)  and  IT?  u  _,(s),  finding  an  ab¬ 
straction  (c,  u,  b)  £  H^_  u  _^(s)  is  not  necessarily  easy. 

In  that  respect,  our  first  tractability  results  are  for  ab¬ 
straction  discovery  within  H^_  u  where  u  is  a  strong 
0-binary  value  partition.  The  first  (and  the  simpler)  result 
in  Theorem  3  further  assumes  a  fixed  action  cost  partition, 
while  the  next  result,  in  Theorem  4,  is  on  simultaneous  se¬ 
lection  of  admissible  pairs  of  cost  and  budget  partitions. 
We  also  show  how  these  results  can  be  extended  to  pseudo¬ 
polynomial  algorithms  for  general  0-binary  value  partitions. 

Theorem  3  ( H‘('c  u  _^(s)  &  strong  0-binary  u) 

Let  II  =  ( V. ,  A ;  s,  c,  u,  b)  be  an  OSP  task,  „4<S  be  an  explicit 
homomorphic  abstraction  skeleton  ofT (II),  and  u  £  Up  be 
a  strong  0-binary  value  partition.  Given  a  cost  partition  c  £ 
C p,  computing  h^c  u b)(s)  for  some  abstraction  (c,  u,  b)  £ 
Hi  u  _}(s)  is  polynomial-time  in  ||II||,  ||AS||,  and  ||ujj. 

Proof:  The  proof  is  by  reduction  to  the  polynomial  fragment 
of  the  Knapsack  problem  corresponding  to  all  items  having 
identical  value.  Let  AS  =  {7),  a,},-g jfcj,  and,  given  that  u 
is  a  strong  set  of  valued  partitions,  let  img(u[i])  =  {0,  er}. 
For  i  £  [k],  let  w,  be  the  cost  of  the  cheapest  path  in  7)  from 
afis)  to  (one  of  the)  states  s'  £  Si  with  u[*](s')  =  a.  Since 
AS  is  an  explicit  abstraction  skeleton,  the  set  {tUiligrw  can 
be  computed  in  time  polynomial  in  ||AS||  using  one  of 
the  algorithms  for  the  single-source  shortest  paths  problem. 
Consider  now  a  Knapsack  [fc] ,  b ),  with  weights  vjt 

being  as  above  and  value  cr  being  identical  for  all  objects. 
Let  X  C  \k]  be  a  solution  to  that  (optimization)  Knapsack 
problem;  recall  that  it  is  computable  in  polynomial  time. 
Given  that,  we  define  budget  profile  b*  £  B  as  follows: 
for  i  £  [fe],  b*  [i]  =  i Vi  if  a,1,  €  X,  and  b*  [i]  =  0,  otherwise. 

What  remains  to  be  shown  is  that  (c,  u,  b*)  actually  in¬ 
duces  an  additive  abstraction  for  II,  that  is,  -4(c  u.b*)  = 
{(7i,  c[i],  u[i],  b*[i])}i£[fe]  £s  -45.  Assume  to  the  contrary 
that  -4(CjU  b»)  fis  -45,  and  let  tt  be  an  optimal  s-plan  for  II. 
By  the  construction  of  our  Knapsack  problem  and  of  b* ,  for 
each  i  £  X,  there  is  a  aj(s)-plan  7 r,;  in  (77,  c[i],  u[i],  b*[z]) 
with  Ui(TTi)  =  a.  According  to  Definition  1,  our  assumption 
implies  that  u(tr)  >  ]Ag  Y  w*(7i»)  =  a  '  |A*|.  On  the  other 
hand,  from  Theorem  2,  there  exists  a  budget  partition  b  £ 
Bp  such  that  5l(c,u.b)  A  -45.  This  budget  partition  induces 
a  feasible  solution  X'  =  {i  \  Wi  <  b [i] }  for  our  Knapsack 
problem  for  which  u{ n)  <  ^(a)  =  a  ‘  \X'\.  This, 

however,  implies  X  <  \X'\,  contradicting  our  assumption, 
and  thus  accomplishing  the  proof  of  -4(c  u  b*)  A  -45.  □ 

The  construction  in  the  proof  of  Theorem  3  may  ap- 
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1404 
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1518 
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Table  1:  Expanded  node  statistics  for  optimal  OSP  with 
BFBB  search  on  a  set  of  IPC  tasks,  cast  as  OSP. 


pear  somewhat  counterintuitive:  while  we  are  interested  in 
minimizing  the  heuristic  estimate  of  h*(s),  the  abstraction 
-4(c,u.b*)  is  selected  via  the  value-maximizing  Knapsack 
problem.  However,  a  correct  view  of  the  situation  would  be 
that  the  selected  triplet  (c,  u,  b* )  provides  us  with  the  lowest 
estimate  of  h*  ( s )  among  all  abstractions  complying  with  u 
and  c,  whose  A(s)  membership,  aka  admissibility,  we  know 
how  to  prove  in  polynomial  time.  Finally,  while  strong  0- 
binary  value  partitions  are  rather  restrictive,  finding  an  el¬ 
ement  of  //(),  u  J.s)  for  general  0-binary  u  is  no  longer 
polynomial — a  reduction  from  Knapsack  is  straightforward. 
However,  Knapsack  is  solvable  in  pseudo-polynomial  time, 
and  plugging  that  Knapsack  algorithm  into  the  proof  of  The¬ 
orem  3  results  in  a  search  algorithm  for  H^c  u  _^(s)  with 
general  0-binary  u,  running  in  time  polynomial  (also)  in  the 
unary  representation  of  the  budget  b. 

For  a  first  test  of  the  value  that  additive  abstractions  can 
bring  to  heuristic-search  OSP,  we  have  prototyped  a  sim¬ 
ple  planning  system  on  the  basis  of  Pyperplan,  a  lightweight 
planner  written  in  Python3 .  Within  that  prototype,  we  have 
provided  support  for  some  basic  pattern-database  abstrac¬ 
tion  skeletons,  action  cost  partitions,  and  abstraction  selec¬ 
tion  in  H^c  u  (s)  for  strong  0-binary  value  partitions  as  in 
the  proof  of  Theorem  3.  As  best-first  forward  search  algo¬ 
rithms  such  as  A*  are  not  suitable  for  optimal  OSP,  we  have 
implemented  a  best-first  branch-and-bound  (BFBB)  search. 
This  BFBB  expands  the  nodes  in  the  decreasing  order  of 


3  https://bitbucket.org/malte/pyperplan 


their  state  values,  with  the  ties  being  broken  towards  higher 
/(-values,  and  then  higher  remaining  budgets.  As  our  heuris¬ 
tic  estimates  always  upper-bound  the  true  values  achievable 
from  states,  if  the  /(-value  of  a  generated  state  is  lower  than 
the  best  state  value  encountered  so  far,  then  that  generated 
state  is  pruned.  The  search  terminates  when  the  search  fron¬ 
tier  becomes  empty,  and  the  optimal  plan  is  then  extracted 
from  the  search  node  associated  with  the  best-value  state  en¬ 
countered  so  far. 

Table  1  compares  BFBB  node  expansions  with  three 
heuristic  functions,  tagged  blind,  basic,  and  hj, ,  on  set  of 
IPC  tasks  that  we  cast  as  OSP  by  associating  a  separate 
value  with  each  goal.  With  all  three  heuristics,  the  /i-value 
of  a  node  o  is  set  to  0  if  the  cost  budget  at  a  is  over¬ 
consumed.  Otherwise,  blind  BFBB  constitutes  a  trivial  base¬ 
line  in  which  h(o)  is  simply  set  to  the  total  value  of  all  goals. 
In  basic  BFBB,  each  goal  is  associated  with  its  atomic  (that 
is,  single  variable)  projection  abstraction,  and  h{o)  is  set 
to  the  total  value  of  goals,  each  of  which  can  be  individ¬ 
ually  achieved  within  the  respective  projection  abstraction 
(see  Theorem  1),  given  the  entire  remaining  budget.  Finally, 
/i_4  is  an  additive  abstraction  heuristic  that  is  selected  from 
//,c.„.  as  in  the  proof  of  Theorem  3,  with  c  being  an  ad 
hoc  cost  partition  over  atomic  projections  of  the  planning 
task  onto  goal  variables,  and  u  being  a  value  partition  that 
associates  the  value  of  each  goal  (only)  with  the  respective 
atomic  projection. 

Each  task  was  approached  under  three  different  budgets, 
which  were  60%,  80%,  and  100%  of  the  minimal  cost 
needed  to  achieve  all  the  goals  in  the  task.  Despite  the  sim¬ 
plicity  of  the  abstraction  skeletons,  the  number  of  nodes  ex¬ 
panded  by  BFBB  with  hj,  is  typically  substantially  lower 
than  the  number  of  nodes  expanded  by  basic  BFBB,  with  the 
difference  sometimes  reaching  three  orders  of  magnitude.4 
While  this  evaluation  is  still  very  preliminary,  it  testifies  to 
the  practical  prospects  of  additive  abstractions  for  OSP. 

Returning  now  to  the  algorithmic  analysis  in  the  context 
of  strong  0-binary  value  partitions,  we  now  proceed  with  re¬ 
laxing  the  constraint  of  sticking  to  a  fixed  action  cost  parti¬ 
tion  c,  thus  buying  more  flexibility  in  selecting  abstractions 
from  Ilf  u  )(s)  (and  improving  the  accuracy  of  our  esti¬ 
mates),  while  still  remaining  computationally  tractable. 

Definition  2  Let  II  =  ( V,  A:  s,  c,  u,  b)  be  an  OSP  task,  515 
be  a  homomorphic  abstraction  skeleton  of  T (II),  and  u  £ 
Up.  By  ks(u)  we  refer  to  the  largest  value  v  £  R0+  such 
that,  for  each  action  cost  partition  c  £  Cp,  there  exists  a 
budget  partition  b  £  Bp  with  (c,  u,  b)  £  Hf_  u  _^(s)  and 

^  (c.u.b)  (^)  —  tt. 

Note  that  ks(u)  can  be  as  low  as  0  (and  for  us, 
“low"  is  good),  even  when,  for  any  0  <  v  < 

maxsGg  X^ie[fe]  u[i](s),  there  exists  some  (c,  u,b)  £ 

4The  runtime  inefficiency  of  Python  turned  out  to  be  an  unfortu¬ 
nate  obstacle:  within  the  allowance  of  30  minutes,  some  instances 
were  solved  by  blind  BFBB  and  were  not  solved  even  by  basic 
BFBB.  and  this  despite  an  extremely  simple  computation  of  basic 
h-v  alues. 


H[_u  (s)  with  /i(c.u,b)(s)  >  v-  In  particular,  note 
that  h(s)  =  ks(u)  is  at  least  as  accurate  as  the  estimate 
h(c  „  _)  (s)  fi'om  Theorem  3  for  any  fixed  cost  partition  c. 

Theorem  4  (Hf_  u  _}(s)  &  strong  0-binary  u) 

Given  an  OSP  task  II  =  (V,  A;  s,  c,  u,  b),  a  homomorphic 
explicit  abstraction  skeleton  .45  of  T (II),  and  a  strong 
0-binary  value  partition  u  £  Up,  determining  ks(u)  is 
polynomial-time  in  ||II||  and  ||.45||. 

Let  .45  =  {71,  cti}ig[fc],  and,  given  that  u  is  a  strong  0- 
binary  value  partition,  let  img(u[*])  =  {0,  o}.  Note  that,  for 
each  (c,  u,  b)  £  Hf_  u  _-j(s),  we  have  /i(c,u,b)(s)  =  trio 
for  some  m  £  {0}  U  [A;].  Our  algorithm  for  determining 
ns  (u)  is  depicted  in  Figure  4,  and  its  high-level  flow  is  as 
simple  as  it  gets:  The  for-loop  of  the  algorithm  decreas- 
ingly  iterates  over  all  the  different  estimates  of  h*(s)  that 
can  possibly  come  from  the  abstractions  in  Hp_  u  -j(s), 
testing  whether  ks(u)  equals  currently  examined  candi¬ 
date  estimate  met.  If  the  test  (provided  by  the  sub-routine 
always-achievable)  is  positive,  then  we  are  done.  Otherwise, 
if  the  test  fails  for  all  m  £  [k],  then  /cs(u)  =  0,  which  in 
particular  implies  that  no  state  with  value  greater  than  0  can 
be  reached  from  s  in  II  with  budget  b. 

The  test  of  always-achievable  for  res(u)  =  mo  is  based 
on  a  certain  linear  program  C\(rn),  the  semantics  of  which 
is  captured  by  Lemma  1  below.5  Informally,  if  C  \  (m)  is 
infeasible,  then  no  cost  partition  over  515  can  provide  us 
with  the  additive  estimate  of  mo,  and  this  independently  of 
the  budget  allowance',  this  can  happen  only  if  all  a- valued 
abstract  states  are  simply  unreachable  from  the  respective 
abstractions  of  s  in  at  least  k  —  m  +  1  tg-structures  of  515. 
Otherwise,  if  C\  (m)  is  solvable,  then  its  solution  establishes 
an  action  cost  partition  over  515  that  induces  the  most  costly 
achievement  of  the  additive  estimate  of  mo  using  515,  with 
the  respective  total  cost  being  captured  in  the  solution  by  a 
specific  LP  variable  5 

Lemma  1  Given  a  planning  task  II  =  (V,  A;  s ,  c,  u,  b),  let 
II /b'  —  ( V, ,  A;  s,  c ,  u,  b').  For  all  m  £  [k\,  ifx  is  a  solution 
of  then 


max  min 

b'  £  B 

(c,  u,  b)  G  #(p_  u  _}(s)  w-r.t.  n 

cGC  p 

^(c,u,b)  ^  771(7 

Otherwise,  if  C\{m)  is  infeasible,  then  for  no  II /b'  there 
exists  (c,  u,  b)  £  with  respect  to  II /b'  such 

that  /j(C.u.b)  >  trio. 

The  correctness  of  the  algorithm  with  respect  to  Theo¬ 
rem  4  stems  from  Lemma  1 :  Suppose  that  the  algorithm  ter¬ 
minates  within  the  loop,  and  returns  mo  for  some  m  >  0. 
By  the  construction  of  the  algorithm,  C\ (m)  is  feasible  and 
if  x  is  a  solution  of  C\{m),  then  x[£]  <  b.  Lemma  1  then 
implies  that,  for  each  action  cost  partition  c  £  Cp,  there 
exists  a  budget  partition  b  £  B,,  such  that  (c,  u,  b)  is  an 
additive  abstraction  for  II  and  /i(C.u,b)(s)  >  tno.  If  m  =  k, 

^The  proof  of  Lemma  1  is  omitted  for  lack  of  space. 


input:  II  =  (V,  A;  s,  c,  u,  b),  AS  =  {77,  a*}ie[fe], 

strong  0-binary  value  partition  u  £  Up 
output:  ks(u) 

for  m  =  k  downto  1  do 

if  always-achievable(m)  then  return  mo- 
return  0 

always-achievable(m): 

solve(£i(m))  i — ^  infeasible  /  solution  x  £  dom(X) 
if  infeasible  or  x[£]  >  b  then  return  false 
else  return  true 

solve(£i(m)): 

set  (5’)  to  an  arbitrary  subset  of  constraints  (5) 

loop 

set  C'i(m )  to  £i(m),  with  constraints  (5’)  instead  of  (5) 
ellipsoid-method)/))  (m))  H ¥  infeasible  /  solution  x  €  dom,(X) 
if  infeasible  return  infeasible 
let  r  be  a  permutation  of  [k]  such  that 

x[b[r(l)]]  <  x[b[r(2)]]  <  •  •  •  <  x[b[r(fe)]] 

if  *[£]  <  Eie[m]  x[b[rW]]  then  return  x 

extend  (5’)  with  constraint  f  <  E;g[m]  b[r(z)] 

Figure  4:  An  algorithm  for  computing  ks  (u)  for  strong  0- 
binary  value  partitions  u  £  Up  (Theorem  4). 

then  trivially  ks  (u)  =  mo.  Otherwise,  if  to  <  k,  we  know 
that  the  algorithm  did  not  terminate  at  the  previous  iteration 
corresponding  to  to  +  1.  Again,  Lemma  1  implies  that  there 
exists  an  action  cost  partition  c  £  C;,  for  which  no  bud¬ 
get  partition  b  of  b  will  induce  an  additive  abstraction  for  II 
with  L(c  u  b)(s)  >  (to  +  l)er  (or,  in  case  of  infeasibility  of 
C\ (to  +  1),  such  a  budget  partition  exists  for  no  action  cost 
partition  at  all.).  Hence,  by  Definition  2,  ks(u)  <  (m+  l)cr, 
and  in  turn,  by  the  structure  of  u,  that  implies  ks  (u)  =  mo. 
Finally,  if  the  algorithm  terminates  after  the  loop  and  returns 
0,  then  precisely  the  same  argument  on  the  basis  of  Lemma  1 
implies  ks(u)  =  0. 

It  remains  now  to  specify  our  linear  programs  C\(m)  in 
detail,  and  analyze  the  complexity  of  solving  them.  Each 
such  linear  program  is  defined  over  variables 

^  =  {C}U  U  {d(s')}s'eTi  u  {b[i]j  U  (J  (c[i](a)}  ,  (2) 

i£[fc]  .  a£A 

constraints  as  in  Eqs.  3-5,  and  the  objective  of  maximizing 
the  value  of  (j.  The  roles  of  the  variables  in  C\ (to)  are  as  fol¬ 
lows.  Variable  c[z](a)  captures  the  cost  to  be  associated  with 
label  a  in  the  tg-structure  %■  For  state  s'  in  7),  variable  d(s') 
captures  the  cost  of  the  cheapest  path  in  7)  from  ay(s)  to  s', 
given  that  the  transitions  (aka  edges)  are  weighted  consis¬ 
tently  with  the  values  of  the  variables  c[i](-).  Variable  b[z] 
captures  the  minimal  budget  needed  for  reaching  in  7)  a  state 
with  value  er  from  state  cti(s),  given  that,  again,  the  transi¬ 
tions  are  weighted  consistently  with  the  variable  vector  c[i]. 
Finally,  £  captures  the  minimal  total  cost  of  reaching  states 
with  value  a  in  precisely  m  t-graphs  induced  by  .45  under 
the  joint  performance  measure  (c,  u,  b). 

The  constraints  of  C\{m)  are  as  follows.  The  first  two 


sets  of  constraints  in  (3)  come  from  a  simple  LP  formu¬ 
lation  of  the  single  source  shortest  paths  problem  with  the 
source  node  on{s ):  optimizing  Eie[fc]  Es'eT;  d(s')  under  a 
fixed  transition  pricing  c  leads  to  computing  precisely  that. 
The  third  set  of  constraints  in  (3)  establishes  the  costs  of  the 
cheapest  paths  in  {77}  from  states  cti(s)  to  states  valued  a, 
enforcing  the  semantics  of  variables  b  [i] .  Constraints  (4) 
are  the  cost  partition  constraints,  enforcing  c  £  Cp.  Finally, 
constraints  (5)  enforce  the  aforementioned  semantics  of  the 
singleton  variable  £. 

Ci  (to)  : 

max  £  subject  to 
Mi  £  [k]  : 

d(s')  =0,  s'  =  ai(s) 

d(s')  <  d(s")  +  e[z](a),  M(s",a,  s')  £  71  ,  (3) 

b[i]  <  d(s'),  Vs'  £  77,  u[z](s')  =  a 

Va  £  A  :  ^  c[z](o)  <  c(a),  (4) 

i£[k] 

VIC  [k],\X\  =m:£  <  ^>[z],  (5) 

iex 

Note  that,  while  the  number  of  variables,  as  well  as  the 
number  of  constraints  in  (3)  and  (4),  are  polynomial  in  j  |H|  | 
and  |AS||,  the  number  of  constraints  in  (5)  is  (^j .  Thus, 
solving  £i(to)  using  standard  methods  for  linear  program¬ 
ming  is  not  practical.  However,  using  the  ellipsoid  algo¬ 
rithm  for  linear  inequalities  (Grotschel,  Lovasz,  and  Schri- 
jver  1981),  an  LP  with  an  exponential  number  of  constraints 
can  be  solved  in  polynomial  time  provided  that  an  associ¬ 
ated  separation  problem  can  be  solved  in  polynomial  time. 
In  our  case,  the  separation  problem  is,  given  an  assignment 
to  the  variables  of  £i(m),  test  whether  it  satisfied  (3),  (4), 
and  (5),  and  if  not,  produce  an  inequality  among  (3),  (4), 
and  (5)  violated  by  that  assignment.  We  now  show  how  our 
separation  problem  for  C\{m)  can  be  solved  in  polynomial 
time  (see  solve(£i(?7i))  in  Figure  4)  using  what  is  called  m- 
sum  minimization  LPs  (Punnen  1992).  As  the  number  of 
constraints  in  (3)  and  (4)  is  polynomial,  their  satisfaction 
by  an  assignment  x  £  dom(X)  can  be  tested  directly  by 
substitution.  For  constraints  (5),  let  r  be  a  permutation  of 
[k]  such  that  x[b[r(l)]]  <  x[b[r(2)]]  <  •••  <  x[b[r(fe)]]. 
If  x[f]  —  Eiefml  x[b[r(*)]]’  then  it  is  easy  to  see  that  x 
satisfies  all  the  constraints  in  (5).  Otherwise,  we  have  our 
violated  inequality  $  <  Eie[m]  b[r («)]. 

From  Strong  to  General  0-Binary  Value  Partitions 

Recall  that  the  polynomial  result  of  Theorem  3  easily  ex¬ 
tends  to  a  pseudo-polynomial  algorithm  for  general  0-binary 
value  partitions.  It  turns  out  that  a  pseudo-polynomial  exten¬ 
sion  of  Theorem  4  is  possible  as  well,  though  it  is  technically 
more  involved. 

Theorem  5  (H^_  u  _,(s)  &  0-binary  u) 

Given  an  OSP  task  n  =  (V,  A;  s,  c,  u,  b),  a  homomorphic 
explicit  abstraction  skeleton  of  T (n),  and  a  0-binary 


input:  II  =  (V,  A;  s,  c,  u,  b),  AS  =  {%,  a*}ie[fc], 

O-binary  value  partition  u  £  Up 
output:  /ts(u) 

let  0  <  e  <  minie[fcj  oy,  a  =  0,  j3  =  JTe[fc]  o» 

while  /3  —  a  >  e  do 

v  —  a  +  (/3  —  a)  /  2 

solve(£2(v))  infeasible/  solution  x  £  dom(X) 

if  infeasible  or  x[£]  >  b  then  /3  =  v 
else  a  =  v 

if  a  =  0  then  return  0;  else  return  [3 

solve(£2(n)): 

set  (6/)  to  an  arbitrary  subset  of  constraints  (6) 

loop 

set  Cn^v)  to  C 2(v),  with  constraints  (6')  instead  of  (6) 
ellipsoid-method(£2(t>)) l— ^  infeasible  /  solution  x  £  dom(X) 
if  infeasible  return  infeasible 

strict-Knapsack({x[b[i]],  ai}i6[fei ,  x[^])  i-»  solution  X  C  [k] 

ifEi6x  cri  <  v  then  return  x 

extend  (6')  with  constraint  £  <  f2iex  b[i] 

Figure  5:  An  algorithm  for  computing  ks(u)  for  0-binary 
value  partitions  u  £  U/(  (Theorem  5). 

value  partition  u  £  Up,  determining  ks(u)  within  n-digit 
precision 6  is  polynomial-time  in  ||II||,  ||£l5||,  log(n),  and  a 
unary  representation  of  the  budget  b  of  II. 

Let  £15  =  {7i,aj}ie[fc],  and,  for  i  £  [k],  img(u[*])  = 
{0,  <7i}.  The  algorithm  for  computing  «s(u)  for  inputs  as  in 
Theorem  5  is  depicted  in  Figure  5.  The  flow  of  that  algo¬ 
rithm  bears  some  similarity  to  the  one  in  Figure  4,  yet  it  is 
different  in  many  respects. 

At  the  high-level,  the  algorithm  performs  a  binary  search 
over  the  hypothesis  interval  [0,  Eie[fc]  °*]-  The  parameter  e 
serves  as  the  “sufficient  precision”  criterion  for  termination; 
while  any  e  >  0  can  be  used,  adopting  e  <  min  [w  er al¬ 
lows  us  to  provide  precision-independent  answers  in  cases 
where  ns  (u)  =  0.  At  iteration  corresponding  to  an  interval 
[a, /?],  the  algorithm  attempts  to  solve  a  certain  linear  pro¬ 
gram  £2(t>),  testing  the  hypothesis  ks(u)  >  v,  where  v  is 
the  mid-point  of  [a,  j3\.  The  test  is  positive  if  £2(tt)  is  feasi¬ 
ble  and  the  value  x[£]  in  the  respective  solution  x  for  £2(t;) 
indicates  that,  for  the  cost  partition  c  induced  by  x,  there  is 
a  budget  partition  b  that  allows  us  to  achieve  the  total  (addi¬ 
tive)  estimate  of  at  least  v  in  t-graphs  induced  by  £15  under 
the  performance  profile  (c,  u,  b).  If  so,  then  the  next  hy¬ 
pothesis  to  test  will  be  ks(u)  >  v' ,  where  v'  is  the  midpoint 
of  [v,  /?].  Otherwise,  the  next  hypothesis  corresponds  to  the 
midpoint  of  [a,v].  To  ensure  admissibility  of  the  estimate, 

('The  statement  of  Theorem  5  involves  the  precision  of  the  esti¬ 
mate  because  the  oy  values  of  the  abstract  value  functions  u[i]  can 
be  arbitrary  real  numbers.  In  the  case  of  integer-valued  sets  of  func¬ 
tions  u,  as  well  as  in  various  special  cases  of  real-valued  functions, 
ks  (u)  can  be  determined  precisely  using  a  simplification  of  the  al¬ 
gorithm  we  introduce  to  support  the  claim  of  Theorem  5.  These 
details,  however,  are  more  of  a  theoretical  interest;  for  reasonably 
small  values  of  e,  in  practice  there  will  be  no  difference  between 
estimates  h(s)  and  h(s)  +  t. 


upon  termination  of  the  loop,  the  estimate  is  set  to  /3;  the 
only  exception  is  the  case  of  the  last  (unexamined)  interval 
being  [0,  e],  in  which  the  estimate  is  safely  set  to  0.  The  cor¬ 
rectness  of  the  algorithm  with  respect  to  Theorem  5  stems 
from  a  lemma  on  £2(i>),  which  is  identical  to  Lemma  1,  mu- 
tatis  mutandis. 

The  LPs  £2(' v),  v  £  R0+,  employed  for  testing  hypothe¬ 
ses  ks  (u)  >  v,  are  also  defined  over  variables  X  as  in  Eq.  2, 
and  are  obtained  from  £i(?n)  by  replacing  constraints  (5) 
with  constraints  (6): 

£2(v)  :  max£  subject  to  constraints  (3),  (4),  and 

MX  C  [k]  s.t.  '^2  rri  >  v  :  f;  <  ^2  b[i].  (6) 

iex  iex 

While  the  semantics  of  all  variables  but  £  remains  as  in 
£i(m),  £  now  captures  the  minimal  total  cost  of  reaching 
some  states  {si},;erfc]  from  states  {aj(s)}ierfc]  in  the  re¬ 
spective  k  t-graphs  induced  by  £15  under  the  performance 
profile  (c,  u,  b)  such  that  Eie[fc]  UM(S*)  —  v-  The  new 
constraints  (6)  enforce  this  semantics  of  £  (and  thus  the 
required  max-min  semantics  of  £2(i>)).  The  number  of 
constraints  in  (6)  is  0(2fc),  and  thus  procedure  soive(£2(t;)) 
also  employs  the  ellipsoid  method  with  a  sub-routine  for 
the  associated  separation  problem.  We  now  show  how  that 
separation  problem  for  £2(f)  can  be  solved  in  pseudo¬ 
polynomial  time  using  a  pseudo-polynomial  procedure 
for  the  strict  Knapsack  problem.  Given  an  assignment 
x  £  dom(X),  its  feasibility  with  respect  to  (3)  and  (4) 
can  be  tested  directly  by  substitution.  For  constraints  (6), 
let  X  C  [k]  be  an  optimal  solution  to  the  strict  Knapsack 
({x[b[i]],  cTj},e[fc],x[£]). 

•  If  the  value  f2ieX  a>  °f  ^  is  smaller  than  v,  then  x  satis¬ 
fies  all  the  constraints  in  (6).  Assume  to  the  contrary  that 
x  violates  some  constraint  in  (6),  corresponding  to  a  set 
X'  C  [fcj.  By  definition  of  (6),  EieA''  —  v'  ancl  our 
assumption,  x[£]  >  Eiex'  x[b[i]].  That,  however,  im¬ 
plies  that  X'  is  a  feasible  solution  for  our  strict  Knapsack, 
and  of  value  higher  than  that  of  presumably  optimal  X. 

•  Otherwise,  if  Ejga  (Ji  —  v’  ’■ben  X  itself  provides  us 
with  a  constraint  in  (6)  violated  by  x. 

Summary 

We  defined  and  investigated  fragments  of  additive  abstrac¬ 
tions  for  oversubscription  planning.  Along  with  revealing 
some  significant  islands  of  tractability,  we  exposed  an  inter¬ 
esting  interplay  between  these  abstractions  and  certain  tools 
of  combinatorial  and  convex  optimization.  Our  empirical 
tests  of  the  basic  abstractions  on  a  prototype  system  testified 
to  the  promise  of  the  developed  approach,.  Our  next  steps 
will  thus  be  to  develop  an  efficient  implementation  of  the 
entire  framework,  and  to  engage  in  further  formal  investiga¬ 
tion  of  what  is  hard  and  what  is  tractable  in  the  context  of 
devising  quality  abstractions  for  oversubscription  planning. 
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On  Combinatorial  Actions 
and  CMABs  with  Linear  Side  Information 

Alexander  Shleyfman  and  Antonin  Komenda  and  Carmel  Domshlak1 


Abstract.  Online  planning  algorithms  are  typically  a  tool  of 
choice  for  dealing  with  sequential  decision  problems  in  combinato¬ 
rial  search  spaces.  Many  such  problems,  however,  also  exhibit  com¬ 
binatorial  actions,  yet  standard  planning  algorithms  do  not  cope  well 
with  this  type  of  "the  curse  of  dimensionality.”  Following  a  recently 
opened  line  of  related  work  on  combinatorial  multi-armed  bandit 
(CMAB)  problems,  we  propose  a  novel  CMAB  planning  scheme,  as 
well  as  two  specific  instances  of  this  scheme,  dedicated  to  exploiting 
what  is  called  linear  side  information.  Using  a  representative  strategy 
game  as  a  benchmark,  we  show  that  the  resulting  algorithms  very  fa¬ 
vorably  compete  with  the  state-of-the-art. 

1  INTRODUCTION 

In  large-scale  sequential  decision  problems,  reasoning  about  the 
problem  is  often  narrowed  to  a  state  space  region  that  is  considered 
most  relevant  to  the  specific  decision  problem  currently  faced  by  the 
agent.  In  particular,  online  planning  algorithms  focus  only  on  the  cur¬ 
rent  state  so  of  the  agent,  deliberate  about  the  set  of  possible  courses 
of  action  from  so  onwards,  and,  when  interrupted,  use  the  outcome 
of  that  exploratory  deliberation  to  select  an  action  to  perform  at  so- 
Once  that  action  is  applied  in  the  real  environment,  the  planning  pro¬ 
cess  is  repeated  from  the  obtained  state  to  select  the  next  action  and 
so  on. 

The  basic  components  of  any  sequential  decision  problem  are  its 
states  and  actions.  When  the  number  of  actions  is  polynomial  in  the 
size  of  the  problem  description,  the  basic  computational  complexity 
of  planning  stems  solely  from  a  prohibitive — exponential  in  the  size 
of  the  problem  representation — size  of  the  state  space.  This  "curse  of 
state  dimensionality"  seems  to  receive  most  of  the  attention  in  the  au¬ 
tomated  planning  research.  In  particular,  state-space  forward  search 
algorithms,  including  all  standard,  both  systematic  and  Monte-Carlo, 
online  planning  algorithms,  implicitly  assume  that  the  action  choices 
at  any  state  can  be  efficiently  enumerated. 

Whatever  the  atomic  actions  of  the  agents  are,  as  long  as  the  agent 
can  perform  only  one  (or  a  small  fixed  number  of)  atomic  actions  si¬ 
multaneously,  the  above  assumption  of  enumerable  action  choices  is 
typically  fine.  However,  if  the  agent  we  are  planning  for  either  rep¬ 
resents  a  team  of  cooperating  agents,  or,  equivalently,  has  a  number 
of  concurrent  actuators,  then  the  problem  exhibits  a  “curse  of  action 
dimensionality"  via  the  combinatorial  structure  of  the  action  space. 

Real-time  strategy  games  (RTS)  are  a  great  example  of  decision 
problems  with  combinatorial  action  spaces,  because  a  player  is  asked 
to  activate  in  parallel  a  set  of  units  that  together  form  the  force  of  the 
player  [2,  6,  7,  15,  17].  That  is,  the  set  of  actions  A(s)  available  to  a 
player  at  a  state  s  corresponds  to  a  (sometimes  proper,  due  to  some 
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game-specific  constraints)  subset  of  the  cross-product  of  explicitly 
given  sets  of  atomic  actions  of  her  units. 

Previous  work  on  online  planning  in  RTS  mostly  avoided  dealing 
with  combinatorial  actions  directly,  either  by  sequencing  the  deci¬ 
sions  made  for  individual  units  [7,  15],  or  by  abstracting  the  com¬ 
binatorial  action  spaces  to  a  manageable  set  of  choices  [2,  6].  The 
exception  seems  to  be  a  recent  work  of  Ontanon  [16]  that  suggested 
considering  combinatorial  actions  in  online  planning  through  the  lens 
of  combinatorial  multi-armed  bandit  (CMAB)  problems  [11,  5.  16]. 
In  particular,  Ontanon  suggested  a  specific  Monte-Carlo  algorithm 
for  online  planning  in  CMABs,  called  Naive  Monte-Carlo  (NMC), 
that  is  driven  by  an  assumption  that  the  expected  value  of  a  combi¬ 
natorial  action  can  be  faithfully  approximated  by  a  linear  function  of 
its  components.  Evaluated  on  the  /rRTS  game,  NMC  was  shown  to 
favorably  compete  with  popular  search  algorithms  such  as  UCT  and 
alpha-beta  ABCD,  which  avoid  dealing  with  combinatorial  actions 
directly  [16]. 

Taking  on  board  the  CMAB  perspective  on  combinatorial  actions 
in  sequential  decision  problems,  here  we  continue  the  study  of  on¬ 
line  planning  algorithms  for  CMABs.  In  particular,  we  formalize 
the  basic  building  blocks  of  such  algorithms,  as  well  as  the  trade¬ 
offs  in  computational  resource  allocation  between  them.  Based  on 
this  analysis,  we  suggest  a  simple,  two-phase  scheme  for  CMAB 
planning,  in  which  the  first  phase  is  dedicated  solely  to  generating 
candidate  combinatorial  actions,  and  the  second  phase  is  dedicated 
solely  to  evaluating  these  candidates.  Adopting  the  assumption  of 
helpful  linear  “side  information",  we  then  propose  two  instances  of 
this  two-phase  scheme,  LSIv  and  LSIf,  that,  in  particular,  build  upon 
some  recent  developments  around  action  selection  in  regular  MAB 
problems  [3,  4,  12],  Our  experimental  evaluation  on  the  ^tRTS  game 
as  in  [16]  shows  that  both  LSIv  and  LSIf  substantially  outperform 
NMC,  as  well  as  emphasizes  the  marginal  value  of  both  exploiting 
side  information  and  of  systematicity  in  candidate  evaluation  pro¬ 
cess. 

2  BACKGROUND 

The  multi-armecl  bandit  (MAB)  problem  is  a  sequential  decision 
problem  defined  over  a  single  state.  At  each  stage,  the  agent  has  to 
execute  one  out  of  some  k  >  2  stochastic  actions  {ai <u}.  with 
a i  being  parameterized  with  an  unknown  distribution  v;, ,  with  expec¬ 
tation  p,{ .  If  tij  is  executed,  the  agent  gets  a  reward  drawn  at  random 
from  Vi . 

Most  research  on  MABs  has  been  devoted  to  the  setup  of  re¬ 
inforcement  leaming-while-acting,  where  the  performance  of  the 
agent  is  assessed  in  terms  of  its  cumulative  regret,  the  sum  of  dif¬ 
ferences  between  the  expected  reward  of  the  best  arm  and  the  ob¬ 
tained  rewards.  Good  algorithms  for  learning-while-acting  in  MAB, 


like  UCB1  [1],  trade  off  between  exploration  and  exploitation.  These 
MAB  algorithms  also  gave  rise  to  popular  Monte-Carlo  tree  search 
algorithms  for  online  planning  in  multi-state  sequential  decision 
problems  (e.g.,  MDPs  and  sequential  games),  such  as  £-MCTS  [181, 
UCT  [14],  and  MaxUCT  [13]. 

However,  as  it  was  first  studied  in  depth  by  Bubeck  et  al.  |4], 
learning-while-acting  and  online  planning  are  rather  different  prob¬ 
lems  that  should  favor  different  techniques.  Unlike  in  learning-while- 
acting,  the  agent  in  online  planning  may  try  the  actions  “free  of 
charge"  a  given  number  of  times  N  (not  necessarily  known  in  ad¬ 
vance)  and  is  then  asked  to  output  a  recommended  arm.  The  agent 
in  online  planning  is  evaluated  by  his  simple  regret,  i.e.,  the  differ¬ 
ence  p*  —  pi  between  the  expected  payoff  of  the  best  action  and  the 
average  payoff  obtained  by  his  recommendation  a; .  In  other  words, 
the  rewards  obtained  by  the  agents  at  planning  are  fictitious.  There¬ 
fore,  good  algorithms  for  online  planning  in  MABs,  like  uniform- 
EBA  [4],  Successive  Rejects  [3],  and  SequentialHalving  [12],  are 
focused  solely  on  exploration,  and  they  already  gave  rise  to  efficient 
Monte-Carlo  tree  search  algorithms  for  online  planning  in  multi-state 
sequential  decision  problems  such  as  BRUE  [9]  and  MaxBRUE  [10]. 

In  contrast  to  regular  MAB  problems,  in  which  rewards  are  asso¬ 
ciated  with  individual  actions  and  a  single  action  is  executed  at  each 
stage,  in  combinatorial  multi-armed  bandit  ( CMAB )  problems,  the 
rewards  are  associated  with  certain  subsets  of  actions,  and  the  agent 
is  allowed  to  simultaneously  execute  such  subsets  of  actions  at  each 
stage  [11,  5,  16].  In  terms  closest  to  problems  that  motivated  our 
work  in  the  first  place,  i.e.,  sequential  decision  problems  for  teams  of 
cooperative  agents,  a  CMAB  problem  is  given  by  a  finite  set  of  n  >  1 
classes  of  actions  {Aj_, . . . ,  An},  with  A,  =  {dip, . . . ,  a;^},  and  a 
constraint  C  C  A  =  [Ai  U  {e}]  x  •  •  •  x  | An  U  {e}],  where  e  denotes 
“do  nothing”,  and  thus  A  is  the  set  of  all  possible  subsets  of  actions, 
with  at  most  one  representative  from  each  action  class.  We  refer  to 
every  set  of  actions  a  6  A  as  a  combinatorial  action,  or  c-action,  for 
short.  Each  c-action  a  is  parameterized  with  an  unknown  distribution 
tz(a),  with  expectation  p( a).  At  each  stage,  the  agent  has  to  execute 
one  out  of  some  2  <  K  =  \C\  <  n"=i  c-actions,  and  if  c-action 
a  is  executed,  then  the  agent  gets  a  reward  drawn  at  random  from 

"( a). 

Whether  our  setup  is  online  planning  in  CMABs  or  learning- 
while-planning  in  CMABs,  it  is  easy  to  see  that  CMAB  problems 
with  K  =  0(poly(n))  can  be  efficiently  approached  with  regu¬ 
lar  MAB  algorithms.  However,  if  the  problem  is  only  loosely  con¬ 
strained  and  thus  the  c-action  space  grows  exponentially  with  n 
(as  it  is  typically  the  case  in  RTS-like  planning  problems),  then 
the  algorithms  for  regular  MAB  problems  are  no-go  because  they 
all  rely  on  assumption  that  each  c-action  can  be  sampled  at  least 
once.  This  led  to  devising  algorithms  for  CMAB  learning-while- 
planning  [11,  5]  and  online  planning  [16],  all  making  certain  as¬ 
sumptions  of  “side  information",  usefulness  of  which  depends  (ei¬ 
ther  formally  or  informally)  on  the  properties  of  p  over  the  polytope 
induced  by  Ai  x  •  •  •  x  An .  Such  a  “side  information”  basically  cap¬ 
tures  the  structure  of  p  targeted  by  the  algorithm,  but  the  algorithm 
can  still  be  sound  for  arbitrary  expected  reward  functions.  This  is, 
for  instance,  the  case  with  the  Naive  Monte-Carlo  algorithm  of  On- 
tanon  [16],  which  we  describe  in  detail  and  compare  to,  later  on. 

3  ONLINE  PLANNING  IN  CMABS 

Due  to  the  "curse  of  action  space  dimensionality",  at  a  high  level,  any 
algorithm  for  online  planning  in  CMABs  should  define  two  strate¬ 
gies: 


1 .  a  candidate  generation  strategy,  for  reducing  the  set  of  candidates 
from  C  to  a  reasonably  small  subset  C*  C  C  of  candidates,  and 

2.  a  candidate  evaluation  strategy,  for  identifying  the  best  c-action 
in  C *  by  gradually  improving  the  corresponding  estimates  of  p. 

Given  such  a  pair  of  strategies,  the  overall  algorithm  can  then  ap¬ 
ply  them,  either  sequentially  or  in  interleaving,  to  sample  the  se¬ 
lected  c-actions.  The  question  is,  of  course,  what  pair  of  strate¬ 
gies  to  adopt,  and  how  to  combine  between  them  so  to  best  ex¬ 
ploit  the  available  planning  time.  The  only  previous  proposal  in  that 
respect  corresponds  to  the  recent  Naive  Monte-Carlo  (NMC)  algo¬ 
rithm  of  Ontanon  [16],  At  a  high  level,  NMC  constitutes  a  composi¬ 
tion  of  e-greedy  sampling  strategies,  operated  under  an  assumption 
that  p  is  linear  in  the  atomic  actions  of  the  CMAB,  i.e.,  p( a)  = 
Y^i=\  ^2jLi  l{aj.3-ea where  lpj  is  the  indicator  function,  and 
Wi-j  £  R.  Specifically,  at  each  stage,  NMC  follows  the  candidate 
generation/evaluation  strategy  with  probability  £o/(l  —  £o),  respec¬ 
tively,  where: 

1.  The  candidate  generation  strategy  generates  and  samples  a  can¬ 
didate  c-action  a  by  selecting  atomic  actions  from  each  set  Ai 
independently  and  c-greedily:  with  probability  £i,  a  will  contain 
the  “empirically  best  atomic  action”  from  Ai ,  and  with  probability 
(1  —  £i),  the  i-th  component  of  a  will  be  selected  from  Ai  uni¬ 
formly  at  random.  Atomic  action  ai-j  is  “empirically  best”  in  Ai 
if,  so  far,  the  average  reward  of  the  (c-action)  samples  involving 
a i-j  is  the  highest  among  the  elements  of  Ai . 

2.  The  candidate  evaluation  strategy  samples  the  empirically  best  ac¬ 
tion  in  (the  current)  set  C*. 

At  the  end,  the  algorithm  output  (and  the  agent  performs)  the  best 
empirical  action  in  C* . 

Assuming  £o,  £i  <  1,  every  c-action  in  C  will  eventually  be  gener¬ 
ated  and  then  sampled  infinitely  often.  Thus,  NMC  converges  in  the 
limit  to  the  best  c-action  in  C,  and  this  independently  of  whether  the 
assumption  of  p's  linearity  in  atomic  actions  actually  holds.  More¬ 
over,  in  an  empirical  evaluation  on  /rRTS  game  in  [16],  NMC  was 
shown  to  substantially  outperform  the  standard  tree  search  algo¬ 
rithms,  such  as  UCT  and  ABCD.  showing  the  promise  of  CMAB 
planning  algorithms  in  decision  problems  with  combinatorial  ac¬ 
tions.  This  precisely  was  the  departing  point  for  our  work  here. 

Considering  the  dynamics  of  NMC,  we  note  that  candidate  gener¬ 
ation  and  candidate  evaluation  in  it  are  stochastically  interleaved,  and 
the  interleaving  is  made  at  the  resolution  of  single  samples.  One  pos¬ 
sible  motivation  for  such  an  interleaving  might  be  in  exploiting  sam¬ 
ples  made  at  the  evaluation  samples  to  improve  the  estimated  side  in¬ 
formation  (aka  linear  function  coefficients)  for  the  generation  steps. 
However,  a  closer  look  suggests  that  such  an  interleaving  of  candi¬ 
date  generation  and  candidate  evaluation  disadvantages  the  planning 
process  twofold. 

•  If  m  samples  are  getting  devoted  directly  to  candidate  generation, 
then  the  algorithm  will  generate  up  to  m  new  candidates.  Thus, 
even  if  the  side  information  assumptions  hold,  a  vast  majority  of 
the  candidates  will  unavoidably  be  generated  without  a  quality 
guidance  of  this  side  information  as  the  latter  is  being  acquired 
gradually  over  the  candidate  generation  process. 

•  More  importantly,  while  NMC  converges  to  the  best  c-action  in 
the  limit,  for  no  reasonable  budget  of  samples  N  =  o(K)  it  can 
provide  any  meaningful  guarantees  on  the  quality  of  the  recom¬ 
mended  action,  not  only  with  respect  to  the  entire  set  of  choices  C 
(which  is  understandable),  but  even  with  respect  to  the  generated 
subset  of  candidates  C* . 
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The  latter  issue  appears  to  be  especially  concerning,  and,  in  partic¬ 
ular,  it  stems  from  the  fact  that,  after  any  number  N  =  o(K)  of 
samples,  the  best  empirical  mean  among  the  c-action  in  C *  might  be 
based  on  just  a  single  sample  of  the  respective  c-action. 

Taking  that  on  board,  in  what  follows  we  examine  the  prospects  of 
algorithms  that  exhibit  no  interleaving  of  candidate  generation  and 
candidate  evaluation  at  all.  These  algorithms  take  a  simple  two-phase 
approach  of  dividing  the  overall  sample  allowance  N  between  the 
candidate  generation  phase  that  runs  first,  using  Ng  samples,  and  the 
candidate  evaluation  phase  that  runs  second,  using  Ne  =  N  —  Ng 
samples.  The  motivation  behind  this  simple  two-phase  scheme  is 
twofold.  Fixing  some  k  c-action  candidates  C*  induces  a  problem  of 
online  planning  in  regular  MAB,  and  state-of-the-art  algorithms  for 
this  problem  guarantee  that  the  probability  of  choosing  sub-optimal 
c-action  from  C*  decreases  exponentially  with  Ne  [4,  3,  12],  Re¬ 
versely,  suppose  that,  given  a  sample  allowance  Ne  for  the  candidate 
evaluation  phase,  the  algorithm  of  our  choice  for  this  phase  guaran¬ 
tees  that  the  recommended  c-action  will  indeed  be  the  best  among  C* 
with  probability  of  at  least  5(k).  If  we  are  interested  in  choice-error 
probability  of  at  most  <5,  then  there  is  no  point  in  coming  up  with 
more  than  some  k(S,  Ne)  candidate  c-actions,  and  thus  the  candidate 
generation  phase  can/should  be  optimized  to  selection  of  precisely 
that  number  of  candidates. 

In  what  follows,  we  suggest  and  evaluate  two  simple  variants  of 
two-phase  online  planning  for  CMAB,  LSIv  (short  for  “linear  side 
information  from  vertices")  and  LSIf  (short  for  “linear  side  informa¬ 
tion  from  facets").  Both  algorithms  assume  the  same  type  of  help¬ 
ful  side  information,  namely  that  p  is  faithfully  approximated  by  a 
function  that  is  linear  in  the  atomic  actions  of  the  CMAB,  and  dif¬ 
fer  only  in  the  way  this  side  information  is  actually  estimated.  Some 
auxiliary  notation:  [nj  for  n  £  N  denotes  the  set  {1, . . . ,  n}.  For  a 
finite-domain,  non-negative,  real-valued  function  f,  T>[f]  denotes  a 
probability  distribution  over  the  domain  of  /,  obtained  by  normaliz¬ 
ing  /  as  a  probability  function  using  a  normalization  of  our  choice. 
For  such  a  probability  distribution  T>[f]  and  a  non-empty  subset  S  of 
the  f's  domain,  by  D[f  \ s]  we  refer  to  the  conditional  of  T>[f]  on  S. 
Finally,  the  operation  of  drawing  a  sample  from  a  distribution  D  is 
denoted  by  ~  D. 

Figure  la  depicts  the  two-phase  sampling  scheme  underlying  both 
LSIv  and  LSIf-  Given  a  partition  of  sample  budget  N  into  Ng  and 
Ne,  the  algorithms  first  generates  k(Ne)  c-actions  (GENERATE),  and 
then  evaluates  these  c-actions  to  recommend  one  of  them  (EVALU¬ 
ATE).  The  GENERATE  procedure  comprises 

(1)  generating  a  weight  function  R  from  atomic  actions  (adopting 
the  linear  side  information  assumption); 

(2)  schematically  generating  a  probability  distribution  Dp  over  c- 
action  space  C,  biased  “towards"  R;  and 

(3)  sampling  (up  to)  k(Ne)  c-actions  C  from  Dp. 

EVALUATE  then  implements  the  recent  SequentialHalving  algorithm 
of  Karnin  et  al.  [12]  for  action  recommendation  (aka  online  planning) 
in  regular  MABs.  Any  other  algorithm  for  this  problem  will  do  as 
well,  but  SequentialHalving  provides  the  best  formal  guarantees  to 
date,  and  it  is  the  algorithm  we  have  used  in  our  empirical  evaluation 
discussed  later  on. 

Steps  (1)  and  (2)  of  GENERATE  are  formulated  above  at  high  level, 
and  there  is  a  number  of  ways  one  can  implement  these  steps.  Con¬ 
sidering  step  (1),  if  p  is  indeed  linear  in  atomic  actions  and  C  com¬ 
prises  all  the  possible  combinations  of  atomic  actions,  then  one  can 
simply  (i)  pick  an  arbitrary  set  of  |A|  c-actions  that  span  the  atomic 
actions  A,  (ii)  use  the  average  rewards  obtained  from  sampling  these 


actions  equally  often  to  construct  a  linear  |A|  x  |A|  system,  (iii) 
solve  this  system  to  obtain  the  coefficients  Wi-j  of  p,  and  (iv)  skip 
the  EVALUATE  step,  recommending  the  c-action  that  maximizes  p. 
However,  both  p  can  be  very  much  non-linear,  and  the  constraint  C 
can  be  arbitrary  complex.  Thus,  the  side  information  should  be  esti¬ 
mated  and  used  in  a  way  that  relies  on,  yet  is  not  constrained  by,  the 
side  information  assumption. 

Given  that,  in  SlDElNFO,  both  algorithms  partition  the  sample  al¬ 
lowance  Ng  equally  between  the  atomic  actions,  and,  for  each  atomic 
action  a,,j,  set  its  weight  R(di,j )  to  the  average  reward  obtained 
from  sampling  some  c-actions  containing  a,,j.  This  is  precisely  the 
point  where  LSIv  and  LSIf  slightly  differ,  and  Figure  lb  depicts  the 
two  corresponding  versions  of  the  EXTEND  subroutine. 

•  In  LSIv,  all  the  m  samples  in  cu  j ’s  budget  are  dedicated  to  a  single 
c-action,  notably  the  c-action  comprising  only  ai,j.  (In  RTS,  this 
corresponds  to  a  c-action  that  activates  unit  i  while  leaving  all 
other  units  idle.) 

•  In  LSIf,  the  m  samples  in  atj\  budget  go  to  some  <  m  c-actions 
containing  ai,j  that  are  generated  uniformly  at  random. 

In  other  words,  LSIv  establishes  weights  R  by  sampling  all  the 
Y"  hi  neighbors  of  a  single  vertex  of  the  polytope  induced  by  A, 
and  LSIf  establishes  weights  R  by  sampling  the  a  facets  induced  by 
the  atomic  actions  A  on  that  polytope.  A  priori,  the  relative  attrac¬ 
tiveness  of  LSIv  and  LSIf  can  be  assessed  only  heuristically:  The 
closer  p  is  to  satisfy  the  assumption  of  linearity,  the  more  advan¬ 
tageous  LSIv  seems  to  be,  and  the  other  way  around,  the  farther  p 
is  from  linearity,  the  more  relatively  reliable  is  the  side  information 
provided  by  LSIf- 

Proceeding  now  with  step  (2)  of  GENERATE,  i.e.,  using  the  weight 
function  R  to  fix  a  probability  distribution  Dp  over  C,  in  Figure  lc 
we  show  two  specific  realizations  of  this  step,  GENERATE-ENTROPY 
and  GENERATE-UNION.  Both  these  realizations  are  motivated  by 
the  fact  that  C  can  comprise  an  arbitrary  subset  of  the  entire  cross- 
product  of  the  atomic  action  classes,  and  in  both,  Dp  is  specified 
implicitly,  via  auxiliary  distributions  over  subsets  of  atomic  actions, 
and  step  (2)  is  effectively  combined  with  step  (3)  of  sampling  k(Ne) 
c-actions  C  from  Dp. 

•  In  GENERATE-ENTROPY,  the  atomic  action  classes  are  ordered 

in  the  increasing  order  of  entropy  that  is  exhibited  by  the  cor¬ 
responding  probability  distributions  D[{R(ai; i), .  .  . ,  _R(a»;/ tj)}], 
as  measured  by  an  entropy  measure  H  (such  as  the  Shannon  en¬ 
tropy,  or  some  other  Renyi  entropy  [8]).  These  measures  quan¬ 
tify  the  diversity  of  probability  distributions,  and  minimize  on 
the  least  diverse  distributions,  which  are  uniform  distributions. 
Hence,  if  c-actions  are  generated  by  sampling  the  atomic  ac¬ 
tion  classes  sequentially,  yet  these  sequential  choices  are  inter- 
constrained,  sampling  the  action  classes  in  the  increasing  order  of 
H (D[{R(a,i- 1), . . . ,  prioritizes  classes  in  which  the 

different  atomic  actions  actually  differ  in  their  purported  value, 
and  thus  the  choice  really  matters. 

•  In  GENERATE-UNION,  the  action  classes  are  not  sampled  inde¬ 
pendently,  but  each  c-action  added  to  C  is  generated  by  sampling 
the  union  of  all  the  atomic  actions  according  to  D[R  [a],  iter¬ 
atively  updating  the  conditional  A  to  contain  only  actions  from 
classes  that  are  yet  to  be  represented  in  the  constructed  c-action 
candidate. 
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procedure  2PHASE-CMAB(7Vg ,  Ne) 

C*  <—  Generate (Ng,k(Ne)) 
a*  <—  Evaluate(C*,  ./Ve) 

return  a* 


procedure  GENERATE(Ars ,  k) 
R  i-  SlDElNFO (Ng) 
set  :  C  — >  [0, 1] 

C*  V-  0 
for  k  times  do 

a~VR 

C*  4 —  c*  u  {a} 

return  C* 


procedure  Evaluate(C*,  Ne) 

//  SequentialHalving  [12] 

Co  4—  C* 

for  i  =  0  to  [log2  | C*  |J  do 
L|Ciiriog'  ic*n-l 

for  each  a  in  Ci  do 
for  m  times  do 

r  ~  z'(a) 

averaging  update  /i(a)  with  r 
Ci- |-i  4—  [|Cj|/2]  /x-best  elements  of  Cj 
return  (the  only)  action  in  C|-log2  |c*  |] 

(a) 


procedure  SiDElNFO(iV9) 

A  <-  U"=l  Ai<  m  *-  J 

for  di-j  in  A  do 
for  m  times  do 

r  ~  ^(EXTEND(ai;J)) 
averaging  update  R(a,i-j)  with  r 

return  R 


procedure  EXTEND(ai;J)  //  LSIv 
a  «-  {ai-j} 

return  a 

procedure  EXTEND(ai;J)  //  LSIf 
a  4-  {ai-j} 
for  l  G  [n]  \  {z}  do 
ai\j  ~  W(-Aj) 
a<-aU  {al-,j} 
return  a 


(b) 


procedure  Generate-Entropy(7Vs  ,  k) 

R  t—  SlDElNFO(Ais) 

C*  <-  0 

for  i  6  [n]  do 

Hi  <-H(V[R\Ai]) 
for  k  times  do 
a  4—  0 

for  i  G  Jn] ,  in  increasing  order  of  Hi  do 
ai-j  ~  T)\R  f 
a  4—  a  U  {cii-,j} 

C*  4 —  C*  U  {a} 

return  C* 


procedure  GENERATE-UNlON(iV9,  k) 
R  4—  SlDElNFO(A^g) 

C*  4—  0 
for  A;  times  do 
a  4—  0 

^  ^  ur=i  ^ 

while  A  /  0  do 

ai;j  ~  PIHU] 
t—  A  \  A; 
a  4—  a  U  {cii-,j} 

C*  4-  C*  U  {a} 

return  C* 


(c) 


Figure  1.  (a)  The  general  2PHASE-CMAB  planning  scheme,  as  well  as  the  LSI  scheme  for  candidate  generation  and  evaluation,  (b)  specifics  of  the  LSIv  and 

LSIf  instances  of  2PHASE-CMAB,  and  (c)  two  specific  procedures  for  LSI  candidate  generation. 


4  TWO-PHASE  CMAB  MEETS  RTS 

In  what  follows,  we  report  on  an  empirical  evaluation  of  2PHASE- 
CMAB  on  top  of  the  /rRTS  game  platform  of  Ontanon  [16]. 2  Our 
objective  in  this  evaluation  was 

•  to  examine  the  relative  effectiveness  of  2PHASE-CMAB  in  general, 
and  of  LSIv  and  LSIf  in  particular, 

•  to  examine  the  marginal  contribution  of  the  two-phases  of 
2PHASE-CMAB,  and 

•  to  examine  the  relevance  of  CMAB  planning  to  multi-state  se¬ 
quential  planning  problems  with  combinatorial  actions,  such  as 
RTS  games. 

The  /rRTS  platform  already  contained  an  implementation  of  Naive 
Monte-Carlo  (NMC),  with  parameters  optimized  as  in  [16],  and  we 
have  added  an  implementation  of  LSIv  and  LSIf;  henceforth,  su¬ 
perscripts  e  and  u  denote  the  versions  of  these  algorithms  using 
Generate-Entropy  and  Generate-Union,  respectively. 


limited  supply  of  the  resource,  and  they  can  be  used  by  both  play¬ 
ers.  Table  1  shows  the  parameters  of  units  and  buildings.  A  player 
can  build  working  units  and  combat  units.  The  working  units  are  all 
identical  (Worker),  and  they  can  move  the  resources  around,  build 
buildings,  and  attack  other  units.  As  attackers,  however,  they  are 
weak.  The  combat  units  come  in  three  types — light  melee  (LMelee), 
heavy  melee  (HMelee),  and  ranged  unit  ( Ranged ) — all  better  attack¬ 
ers  than  working  units,  each  with  its  own  strengths  and  weaknesses. 
In  general,  movements  and  attacks  are  possible  only  within  the  4- 
neighbourhood  of  the  unit;  all  actions  are  durative  and  not  interrupt¬ 
ible.  Finally,  working  units  and  combat  units  can  be  built  only  in 
Base  and  Bairacks  buildings,  respectively. 


HP 

Cost 

T(Move) 

Damage 

Range 

T(Prod) 

Base 

10 

10 

— 

— 

— 

250 

Barracks 

4 

5 

— 

— 

— 

200 

Worker 

1 

1 

10 

1 

1 

50 

LMelee 

4 

2 

8 

2 

1 

80 

HMelee 

4 

2 

12 

4 

1 

120 

Ranged 

1 

2 

12 

2 

3 

100 

4.1  /iRTS 

/.tRTS  is  a  two-player  zero-sum  game  that  exhibits  standard  features 
of  popular  RTS  games,  and  in  particular,  heterogeneous  units  with 
durative  actions  that  can  be  activated  concurrently.  In  our  experi¬ 
ments  we  use  the  8x8  grid  environment.  The  environment  is  fully 
observable,  and  each  grid  cell  can  be  occupied  either  by  a  single  unit, 
or  by  a  building,  or  by  a  resource  storage.  The  storages  each  have  a 

1  We  would  like  to  thank  Santiago  Ontanon  for  making  the  /jRTS  platform 
available  to  the  public. 


Table  1.  Parameters  of  different  buildings  and  units  in  //RTS.  HP  stands 
for  health  points,  Cost  is  in  the  resource  units,  T(Move)  is  the  duration  of  a 
single  move  (in  simulated  time  units).  Damage  and  Range  represent  decrease 
of  HP  of  the  target  unit  and  the  range  of  the  attack,  respectively,  and  T(Prod) 
is  the  duration  of  producing  the  unit/building. 

For  each  player,  the  initial  state  of  the  game  contains  one  Base,  one 
Worker  near  the  base,  and  one  nearby  resource  storage,  which,  even 
if  the  resources  are  gathered  optimally,  suffices  only  for  1/3  of  the 
maximal  game  duration.  The  game  is  restricted  to  3000  simulated 
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Figure  2.  Three  typical  phases  of  the  game  in  terms  of  unit  counts:  fa)  early,  (b)  mid,  and  (c)  end  phases.  Picture  (d)  depicts  a  “face-up"  complex  decision 
scenario:  top  to  down,  the  rows  of  units  are  Ranged,  HMelee,  and  LMelee  of  one  player,  and  then,  in  reverse  order,  the  units  of  the  other  player. 


time  units,  and  typically,  it  evolves  in  three  phases  in  terms  of  the 
number  of  controlled  units  (and  thus,  the  number  of  alternative  c- 
actions):  (i)  early  game  (less  units),  (ii)  mid  game  (more  units)  and 
(iii)  end  game  (less  units).  Figures  2  depict  representative  states  from 
these  three  phases. 

4.2  Experiments 

In  our  experiments,  we  compared  the  performance  of  LSIv  and  LSIf 
to  that  of  NMC.  In  /rRTS  settings,  the  latter  was  already  shown  to 
substantially  outperform  both  state-of-the-art  tree  search  algorithms, 
such  as  ABCD,  £-greedy,  and  UCT,  as  well  as  regular  MAB  algo¬ 
rithms  and  some  handcrafted  heuristics  [16].  As  a  baseline,  we  also 
added  two  basic  algorithms  that  were  already  implemented  in  /.iRTS, 
namely 

Random,  selecting  a  random  action  for  each  agent  as  soon  as  it  can 
act,  and 

LRush,  a  handcrafted  heuristic  policy,  corresponding  to,  first,  opti¬ 
mally  gathers  resources  with  one  of  the  workers  and  building  a 
Barracks  as  soon  as  it  becomes  possible,  and  after  that,  building 
only  LMelee  units  which  go  towards  and  attack  the  closest  enemy 
units  and  buildings. 

Considering  2PHASE-CMAB,  we  have  implemented  LSIv  and  LSIf 
with  both  GENERATE-ENTROPY  and  GENERATE-UNION,  resulting 
in  four  2PHASE-CMAB  algorithms:  LSIy,  LSIv,  LS Ip,  and  LSIp.  In 
line  with  the  £o  parameter  of  NMC  being  preset  to  0.25,  we  have  set 
the  Ng  and  Ne  parameters  of  2PHASE-CMAB  to  0.25N  and  0.75 IV, 
respectively.  The  number  of  candidates  k(Ne)  was  set  so  that  the  first 
iteration  of  SequentialHalving  will  sample  each  candidate  at  least 
once.  The  H- measure  in  GENERATE-ENTROPY  was  set  to  the  Shan¬ 
non  entropy. 

Likewise,  to  assess  the  marginal  value  of  exploiting  side  infor¬ 
mation,  we  have  also  implemented  a  simplified  instance,  noSI,  that 
selects  k(NB)  c-action  candidates  uniformly  at  random,  and  then 
passes  these  candidates  to  EVALUATE  that  implements  Sequential- 
Halving  like  LSIv  and  LSIf-  In  other  words,  noSI  is  a  purified  ver¬ 
sion  of  2PHASE-CMAB  that  relies  on  no  side  information  whatsoever. 
Importantly,  to  allow  a  meaningful  comparison,  the  number  of  can¬ 
didates  k(NB)  in  noSI  was  set  exactly  as  in  LSIv/LSIf  variants,  yet 
the  trivial  GENERATE  phase  of  noSI  then  uses  only  k(Ne)  <g  Ng 
samples,  throwing  out  the  residual  Ng  —  k(Ne)  samples. 

To  reduce  as  much  as  possible  the  effect  of  the  variance,  as  well 
as  of  possible  biases  of  the  game  simulator,  each  algorithm  played 


against  each  other  algorithm  600  games,  300  games  as  “player  1 "  and 
300  games  as  "player  2".  For  all  the  CMAB  algorithms  (including 
NMC), 

•  the  overall  computational  effort  was  set  to  N  =  2000  samples  per 
decision; 

•  each  sample  comprised  a  simulated  rollout  of  200  game  time  units, 
with  actions  along  the  rollouts  being  selected  at  random,  with  a 
bias  towards  towards  attacking  a  unit  in  reach,  if  there  is  such;  and 

•  for  rollouts  ending  at  non-terminal  states,  the  reward  was  assessed 
by  a  build-in  evaluation  function,  reflecting  the  number  of  units 
and  their  health  at  the  rollout’s  end-state. 

This  lookahead  and  evaluation  procedure  is  similar  to  the  one  used 
in  the  original  evaluation  of  NMC  [16]. 

Table  2  shows  the  results  of  this  head-to-head  competition  be¬ 
tween  the  algorithms. 

•  Consistently  with  the  previous  experiments  of  Ontanon  [16],  all 
the  CMAB  algorithms  easily  defeated  both  Random  and  LRush, 
with  the  LSI  instances  never  losing  to  these  two  baselines.  The 
latter  is  not  so  for  NMC,  but  its  outperforming  of  Random  and 
LRush  is  also  a  clear  cut. 

•  All  the  four  LSI  instances  of  2PHASE-CMAB  consistently  outper¬ 
formed  noSI.  At  the  same  time,  the  performance  of  the  LSI  in¬ 
stances  among  themselves  was  rather  on  par.  In  sum,  this  perfor¬ 
mance  of  LSI  vs.  noSI  testifies  both  for  the  usefulness  of  linear 
side  information  (at  least  in  this  specific  benchmark),  as  well  as 
for  the  ability  of  the  four  GENERATE  procedures  of  LSI  instances 
to  home  in  on  this  side  information. 

•  All  the  five  2PHASE-CMAB  instances,  including  noSI,  substan¬ 
tially  outperformed  NMC.  These  results,  and  especially  the  result 
for  noSI  vs.  NMC,  strongly  testify  for  the  importance  of  a  system¬ 
atic  candidate  evaluation,  and  a  controlled  choice  of  the  number 
of  candidates  to  evaluate. 

The  latter  point  is  even  stronger  supported  by  the  results  of  an 
additional  experiment  that  we  performed  on  a  complex  decision  sce¬ 
nario  with  12  combat  units  at  each  side  (see  Figure  2d),  and  5184 
applicable  c-actions  per  player.  Figures  3a  and  3b  show  that  the  em¬ 
pirical  mean  and  variance  of  the  value  that  NMC  and  LSIv  estimated 
for  the  c-actions  they  ended  up  recommending  were  rather  similar, 
and  this  consistently  over  different  sizes  of  sample  budget.  At  the 
same  time,  the  number  and  the  magnitude  of  the  outliers,  especially 
of  those  overestimating  the  evaluation,  were  substantially  larger  with 
NMC.  These  overestimates  are  critical  in  strategic  scenarios  as  the 
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w/t/l  — F 

Random 

LRush 

NMC 

noSI 

LSIv 

LSIy 

LSIp 

LS  Ip 

Random 

38/27/35 

94/0/6 

100/0/0 

100/0/0 

100/0/0 

100/0/0 

100/0/0 

100/0/0 

LRush 

0/100/0 

96/0/4 

98/0/2 

100/0/0 

100/0/0 

100/0/0 

100/0/0 

NMC 

41/13/46 

52/12/36 

54/14/32 

55/14/31 

54/15/31 

52/15/32 

noSI 

42/17/41 

46/17/40 

47/14/38 

44/17/39 

46/17/37 

LSIv 

40/18/42 

42/19/40 

42/16/42 

43/18/39 

LSIy 

41/17/42 

39/16/44 

45/15/40 

LSIp 

43/17/40 

41/17/42 

LS 1  p 

41/16/43 

Table  2.  The  results  of  the  head-to-head  competition:  the  percentage  of  wins/ties/looses  of  the  column  algorithm  against  the  row  algorithm. 
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Figure  3.  Empirical  mean  and  variance  of  the  estimated  value  of  the  c-action  recommended  by  NMC  (a)  and  LSIy  (b)  in  a  complex  “face-up”  decision 
scenario  {Figure  2d),  after  different  sampling  budgets  (x-axis),  ten  runs  per  sample  budget.  The  dots  show  the  outliers  that  deviate  three  standard  deviation  from 
the  mean.  Respectively,  (c )  depicts  the  variance  in  the  number  of  samples  dedicated  to  the  recommended  c-action  by  N  M  C  and  LSIy. 


“face-up”  scenario  used  in  the  experiments,  because  making  a  par¬ 
ticularly  bad  decision  here  can  determine  loosing  the  entire  game. 
An  example  of  a  such  decision  in  /iRTS  is  whether  to  build  Barracks 
or  not  at  an  early  stage  of  the  game.  As  it  is  illustrated  in  Figure  3c, 
these  drastically  overestimating  outliers  are  caused  by  the  extreme 
variance  in  the  number  of  samples  used  by  NMC  to  estimate  the 
value  of  the  c-action  that  ends  up  being  recommended,  with  quite 
often  the  recommendation  being  based  on  just  a  single  estimating 
sample  of  the  respective  c-action.  On  the  contrary,  the  variance  in  the 
number  of  samples  used  to  estimate  the  value  of  the  recommended 
c-action  in  the  systematic  candidate  evaluation  procedure  of  LSIv  is 
almost  negligible,  making  the  c-action  selection  process  much  more 
robust. 
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Abstract 

Similarly  to  classical  planning,  in  MA-Strips  mul¬ 
tiagent  planning,  heuristics  significantly  improve  effi¬ 
ciency  of  search-based  planners.  Heuristics  based  on 
solving  a  relaxation  of  the  original  planning  problem 
are  intensively  studied  and  well  understood.  In  partic¬ 
ular,  frequently  used  is  the  delete  relaxation,  where  all 
delete  effects  of  actions  are  omitted.  In  this  paper,  we 
present  a  unified  view  on  distribution  of  delete  relax¬ 
ation  heuristics  for  multiagent  planning. 

Until  recently,  the  most  common  approach  to  adapta¬ 
tion  of  heuristics  for  multiagent  planning  was  to  com¬ 
pute  the  heuristic  estimate  using  only  a  projection  of 
the  problem  for  a  single  agent.  In  this  paper,  we  place 
such  approach  in  context  with  techniques  which  allow 
sharing  more  information  among  the  agents  and  thus 
improve  the  heuristic  estimates.  We  thoroughly  exper¬ 
imentally  evaluate  properties  of  our  distribution  of  ad¬ 
ditive,  max  and  Fast-Forward  relaxation  heuristics  in 
a  planner  based  on  distributed  Best-First  Search.  The 
best  performing  distributed  relaxation  heuristics  favor¬ 
ably  compares  to  a  state-of-the-art  MA-Strips  planner 
in  terms  of  benchmark  problem  coverage.  Finally,  we 
analyze  impact  of  limited  agent  interactions  by  means 
of  recursion  depth  of  the  heuristic  estimates. 

Introduction 

Planning  in  a  shared  deterministic  environment  for  and 
by  a  team  of  cooperative  agents  with  common  goals 
is  a  natural  extension  of  classical  planning.  To  model 
such  planning  problems,  (Brafman  and  Domshlak  2008) 
proposed  a  multiagent  extension  of  the  classical  Strips 
formalization,  MA-Strips.  The  model  presumes  a  set 
of  cooperative  agents  defined  by  their  capabilities  in 
form  of  a  set  of  actions  partitioned  from  the  original 
planning  problem.  In  general,  not  all  the  agents  need 
to  (or  even  can  not)  consider  the  complete  planning 
problem,  therefore  only  subsets  of  facts  and  actions  are 
marked  as  public  and  known  to  the  whole  team. 

In  recent  years,  several  multiagent  planning  tech¬ 
niques  solving  MA-Strips  problems  were  proposed. 
One  focusing  on  optimality,  scalability  and  efficiency 
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tificial  Intelligence  (www.aaai.org).  All  rights  reserved. 


was  proposed  by  (Nissirn  and  Brafman  2012).  It 
adopted  one  of  the  currently  most  successful  approaches 
in  classical  planning-  Best-First  Search  with  highly  in¬ 
formed  automatically  derived  heuristics.  The  heuristics 
used  were  LM-cut  (Helmert  and  Domshlak  2009)  with 
pathmax  equation  and  merge-and-shrink  (Helmert, 
Haslum,  and  Hoffmann  2007),  both  admissible  as  the 
approach  aimed  at  optimal  planning.  The  heuristics 
used  only  local  information  of  the  respective  agents,  ef¬ 
fectively  working  only  with  their  own  facts  and  actions 
and  public  actions  of  other  agents.  (Crosby,  Rovatsos, 
and  Petrick  2013)  proposed  a  centralized  planning  ap¬ 
proach  based  on  multiagent  decomposition  and  local 
heuristic  estimates.  The  approach  used  delete  relax¬ 
ation,  particularly  the  Fast-Forward  (FF)  (Hoffmann 
and  Nebel  2001)  heuristics  and  focused  on  satisfiability. 
Another  approach  proposed  by  (Borrajo  2013)  can  in 
principle  end  up  as  planning  with  a  global  heuristic  esti¬ 
mations,  however  requires  private  information  of  other 
agents1.  On  the  contrary,  the  approach  by  (Torreno, 
Onaindia,  and  Sapena  2013)  preserves  private  knowl¬ 
edge  and  proposes  distributed  heuristic  estimate,  how¬ 
ever  it  is  not  based  on  relaxation  of  the  original  prob¬ 
lem,  but  on  Domain  Transition  Graphs  (Helmert  2006). 
In  discussion  of  (Nissirn  and  Brafman  2012),  the  authors 
state  that  “the  greatest  practical  challenge  [...]  is  that 
of  computing  a  global  heuristic  by  a  distributed  system’’. 
A  recent  work  by  (Stolba  and  Komenda  2013)  targeted 
this  challenge  for  distributed  Fast-Forward  heuristic 
with  focus  on  obtaining  the  same  estimates  as  by  a 
centralized  solution,  rather  than  searching  for  a  general 
solution  for  wide  variety  of  relaxation  heuristics. 

In  this  work,  we  focus  on  distribution  of  the  gen¬ 
eral  principle  of  delete  relaxation  heuristics  in  MA- 
Strips  planning  with  state-of-the-art  implementation 
approaches.  We  evaluate  properties  of  such  distributed 
heuristics  both  from  computational  and  communication 
perspectives  and  analyze  the  quality  of  the  heuristics 
based  on  estimation  depth  in  a  sense  of  participating 
agents,  i.e.,  agent  coupling  relaxation. 


1The  proposed  solution  used  obfuscation  of  the  private 
information,  which  can  be  prohibited  in  cases  where  even 
the  “structure”  of  the  information  has  not  to  be  revealed. 


Multiagent  Planning 

A  MA-Strips  (Brafman  and  Domshlak  2008)  planning 
problem  is  a  quadruple  II  =  (£,  A,  So,  Sg),  where  £  is 
a  set  of  propositions,  A  is  a  set  of  cooperative  agents 
aq, . . . ,  ci|_4|,  So  is  an  initial  state  and  Sg  is  a  set  of  goal 
states.  A  state  s  C  £  is  a  set  of  atoms  from  a  finite  set 
of  propositions  £  =  {p\, . . .  ,pm}  which  hold  in  s.  An 
action  is  a  tuple  a  =  (pre(a),  add(a),  del  (a)) ,  where  a  is 
a  unique  action  label  and  pre(a),  add(a),  del(a)  respec¬ 
tively  denote  the  sets  of  preconditions,  add  effects  and 
delete  effects  of  a  from  £. 

An  agent  a  =  {a±, an}  is  characterized  precisely 
by  its  capabilities,  a  finite  repertoire  of  actions  it  can 
preform  in  the  environment.  MA-Strips  problems  dis¬ 
tinguish  between  the  public  and  internal  (or  private) 
facts  and  actions.  Let  atoms(a)  =  pre(a)  U  add  (a)  U 
del(a)  and  similarly  atoms(a)  =  (JaeQ  atoms(a).  An  a- 
internal  subset  of  all  facts  £  of  agent  a  will  be  denoted 
as  £a~lnt,  where  £a~lnt  =  atoms(o;)  \  U/3g.4\a  atoms(/3) 
and  a  subset  of  all  public  facts  as  £pub  =  £  \ 
Uae.4  £a~lnt-  All  facts  relevant  for  one  particular  agent 
a  are  denoted  as  £a  =  £a~lnt  u  £Pub  ancj  a  projection 
of  a  state  sa  to  an  agent  a  is  a  subset  of  a  global  state 
s  containing  only  public  facts  and  a-internal  facts,  for¬ 
mally  sa  =  s  D  £a.  The  set  of  public  actions  of  agent 
a  is  defined  as  apub  =  {a  \  a  £  a,  atoms(a)  n  £pub  ^  0} 
and  internal  actions  as  alnt  =  a  \  apub.  The  symbol 
o“  will  denote  a  projection  of  action  a  £  0,0  ^  a  for 
agent  a,  i.e. ,  action  stripped  of  all  propositions  of  other 
agents,  formally  atoms(aQ)  =  atoms(a)  0  £a.  Finally, 
aProJ  will  denote  the  set  of  all  public  actions  of  other 
agents  A  \  a  projected  for  agent  a. 

Note  that  all  actions  of  an  agent  a  uses  only  agent’s 
facts,  formally  Va  £  a  :  atoms(a)  C  £°  by  definition 
in  (Brafman  and  Domshlak  2008).  The  goal  set  Sq 
of  a  multiagent  planning  problem  will  be  treated  as 
public  (Nissim  and  Brafman  2012),  therefore  all  goal- 
achieving  actions  are  public. 

Multiagent  Best-First  Search 

The  planning  algorithm  we  assume  for  the  further  anal¬ 
ysis  of  the  distributed  heuristics  is  based  on  a  multi¬ 
agent  Best-First  Search  (MA-BFS)  derived  from  the 
work  in  (Nissim  and  Brafman  2012)  with  deferred- 
evaluation  of  (lazy)  heuristics.  It  is  outlined  in  Algo¬ 
rithm  1.  The  MA-BFS  algorithm  is  based  on  the  stan¬ 
dard  textbook  BFS.  Firstly,  the  Open  list  is  initialized 
with  the  initial  state  st  and  the  Closed  list  is  empty 
(line  1).  In  an  infinite  loop  (lines  2-15),  the  Open  list 
is  polled  to  obtain  state  s  (line  4).  If  s  was  not  pro¬ 
cessed  yet,  it  is  marked  as  closed  (lines  5,  6).  If  s  is 
a  goal  state,  the  search  is  terminated  and  the  plan  is 
reconstructed  (lines  7-9),  otherwise,  the  heuristic  esti¬ 
mate  of  the  state  is  computed  (line  10). 

The  MA-BFS  differs  in  several  aspects.  Because  the 
computation  of  an  heuristic  estimate  may  invoke  com¬ 
munication  among  multiple  agents,  the  heuristic  esti¬ 
mators  are  asynchronous.  The  estimator  is  called  with 


Algorithm  1  Multiagent  Best-First  Search 
Input:  Initial  state  Sj,  goal  Sq  Q  £,  set  of  agent’s 
actions  a,  heuristic  estimator  %. 

1:  Open  ■£-  {si}, Closed  ■£-  0 
2:  while  true  do 
3:  if  Open  ^  0  then 

4:  s  ■£-  poll(Open) 

5:  if  s  ^  Closed  then 

6:  Closed  ■£-  Closed  U  {s} 

7:  if  s  unifies  with  Sg  then 

8:  reconstruct  the  plan 

9:  end  if 

10:  Td{s,  heuristicComputedCallback(s,  h)) 

11:  end  if 

12:  end  if 

13:  process  heuristic  messages 

14:  process  search  messages 

15:  end  while 


16:  heuristicComputedCallback(s,  h ): 

17:  set  the  heuristic  estimate  of  s  to  h 
18:  if  h  ^  oo  then 

19:  if  s  obtained  by  a  s.t.  a  £  apub  then 

20:  send  state  s 

21:  end  if 

22:  Open  <—  Open  U  expand(s,  a) 

23:  end  if 


a  callback  in  its  parameter  and  the  main  loop  immedi¬ 
ately  continues  (line  10).  When  the  heuristic  estimation 
is  finished,  the  callback  (lines  16-23)  is  invoked  and  it 
performs  the  standard  procedure  of  setting  the  heuristic 
value  (line  17)  and  expanding  the  state  using  applicable 
actions  (line  22).  In  addition  to  that,  if  the  state  was 
obtained  by  expanding  a  public  action,  the  state  is  sent 
to  all  other  agents. 

Finally  at  the  end  of  each  loop  (lines  13  and  14) 
the  messages  related  to  the  heuristic  estimation  (see 
next  section)  and  the  messages  containing  states  sent 
by  other  agents  (originating  from  line  20)  are  processed. 
The  received  states  are  added  to  the  Open  list.  Note, 
that  in  MA-Strips  a  global  state  s  C  £  is  seen  by  send¬ 
ing  agent  a  as  a  projection  sa  and  by  receiving  agent 
0  as  a  projection  s'3 . 

Distribution  of  Relaxation  Heuristics 

One  general  approach  to  compute  heuristic  estimates 
is  to  compute  a  solution  of  a  relaxed  planning  problem. 
Such  problems  have  some  constraints  removed  (relaxed) 
in  order  to  make  it  easier  to  solve  them.  One  of  well- 
known  and  thoroughly  studied  relaxations  is  obtained 
by  removing  the  delete  effects  of  all  actions.  Our  mo¬ 
tivation  in  this  paper  is  to  extend  this  concept  to  mul¬ 
tiagent  planning,  therefore  we  will  focus  on  classical 
delete  relaxation  heuristics:  (i)  inadmissible  hadd ,  (h) 
admissible  hmax  both  introduced  in  (Bonet  and  Geffner 
1999)  and  (iii)  inadmissible  hpF,  which  was  presented 


in  (Hoffmann  and  Nebcl  2001).  In  the  following  sub¬ 
sections,  we  will  present  efficient  multiagent  distribu¬ 
tions  of  those  three  heuristics. 

Distribution  of  hadd  and  hmax 

Both  additive  and  max  heuristics  work  on  a  very  sim¬ 
ilar  principle  and  are  typically  formalized  as  a  set  of 
recursive  equations,  such  the  following  for  hadd- 

hadd^P-i  S)  ~  ^  ^  p£P  hgddiPi  (1) 

fo  if  p  £  s 

Kdd{p,s )  =  lhadd(argminae0(p)[Kdd{a,s)\,s)  (2) 

(  otherwise 

hadd(a,s)  =  cost(a)  +  hadd(.pre(a),s),  (3) 

where  P  is  a  set  of  propositions  (i.e.,  goal  or  action 
preconditions),  s  is  a  state,  a  is  an  action  and  0(p)  is 
a  set  of  actions  which  achieve  p,  formally  0(p)  =  {a  £ 
a\p  £  add(a)}.  The  equations  for  hmax  are  the  same 
except  for  Equation  1  where  is  a  max  function  instead 
of  sum,  therefore  everything  we  state  about  hadd  applies 
similarly  to  hmax. 

In  the  multiagent  setting,  some  of  the  actions  in  the 
argmin  clause  in  Equation  2,  where  we  are  choosing 
the  minimal  cost  action  among  actions  achieving  the 
proposition  p ,  may  be  projections  of  other  agent’s  pub¬ 
lic  actions  (a  £  aUaproj).  Let  a  be  the  agent  currently 
computing  the  heuristic  estimate  of  the  state  s  and  /3 
be  some  other  agent.  Let  for  some  proposition  p  exist 
an  action  a  £  (3  s.t.  a“  €  O(p).  In  such  case,  there  are 
two  options  how  to  handle  the  situation. 

One  option  is  to  ignore  the  fact  that  the  action  is  a 
projection  and  continue  as  if  it  was  an  ordinary  action. 
This  way,  we  may  leave  out  some  preconditions  of  the 
action  (private  to  the  owning  agent),  but  we  still  get 
lower  or  equal  estimate  of  the  action  cost  (by  including 
the  private  preconditions  we  can  only  increase  the  cost). 
We  will  denote  the  approach  as  a  projected  heuristic. 
Projected  heuristics  require  no  communication  at  all. 

The  other  option  is  to  always  compute  true  estimate. 
In  order  to  do  so,  the  agent  a  sends  request  r  =  ( aa ,  s) 
to  agent  /?  to  obtain  true  estimate  of  the  cost  of  the 
action  aa.  Upon  receiving  the  request,  agent  /3  calls 
hadd(pre(a) ,  s)  and  returns  the  result  in  a  reply.  It  is 
obvious  that  in  order  to  compute  the  heuristic  esti¬ 
mate,  agent  fj  may  need  to  send  similar  requests  to 
other  agents,  or  even  back  to  agent  a.  This  way,  we 
end  up  with  a  distributed  recursive  algorithm,  which 
gives  exactly  the  same  results  as  a  centralized  hadd  on 
a  global  problem  n,3  =  (C,  Uq<=.4  a,  so,  Sg)>  since  for 
every  projection  a“  of  action  a  £  (3,  the  true  cost  of  a 
is  obtained  from  the  agent  j3. 

A  middle  ground  between  the  presented  two  extremes 
is  to  limit  the  recursion  depth  S.  If  the  maximum 
recursion  depth  5max  is  reached,  all  projected  actions 
are  evaluated  without  sending  any  further  requests. 
This  effectively  means  introduction  of  another  relax- 


Algorithm  2  Distributed  Relaxed  Exploration 
Input:  Boolean  flag  r  (true  when  first  called),  global 
exploration  queue  Q 

relaxedExploration(?’) : 

1:  while  <2^0  do 
2:  p-^poll(Q) 

3:  if  p  £  goal  and  achieved(p')  :  Vp'  Ggoal  then 

4:  return 

5:  end  if 

6:  Op  {a  £  a  U  aproi  \p  £  pre(a)} 

7:  for  all  a  €  do 

8:  increment  costfp)  by  cost(a) 

9:  if  achieved(p')  :  Vp'  £  pre(a)  then 

10:  enqueueProposition(a,  eff(a),  r) 

11:  end  if 

12:  end  for 

13:  end  while 


Input:  Action  a,  proposition  p,  Boolean  flag  r 

enqueueProposition(a,p,  r): 

14:  if  costfp)  =  lor  cost{p)  >  cost(a)  then 
15:  costfp)  t—  cost(a) 

16:  if  r  and  a  £  apr°i  then 

17:  send  request  message  Mreq  =  (s,  a,  0) 

to  owner  (a), 
process  the  reply  by 

receive  Reply  EnqueueCallback  (p,  _) 

18:  else 

19:  Q  <-  Q  U  {p} 

20:  end  if 

21:  end  if 


Input:  Heuristic  estimate  h,  proposition  p  (set  from 
enqueueProposition) 

receiveReplyEnqueueCallback(p,  h): 

22:  if  cost{p)  >  h  then 
23:  costfp)  t—  h 

24:  S^SU  {p} 

25:  end  if 

26:  relaxedExploration(false) 

27:  if  no  unresolved  requests  then 
28:  return  compute  the  total  cost 

29:  end  if 


ation  of  the  original  problem  where  the  interaction  be¬ 
tween  agents  is  limited — the  agent  coupling  relaxation. 
Such  heuristic  estimation  is  always  lower  or  equal  than 
would  be  the  heuristic  estimation  in  the  global  problem 
using  a  centralized  heuristic  estimator,  because  ig¬ 
noring  preconditions  of  an  action  in  its  projection  can 
never  increase  the  cost  of  the  action  .  By  limiting  the 
recursion  depth  to  6max  =  0,  we  return  back  to  the  pro¬ 
jected  heuristic,  where  all  interactions  between  agents 
are  relaxed  away. 


Relaxed  exploration.  Although  the  definition  of 
hadd  by  a  set  of  recursive  equations  is  intuitively  clear 
and  provides  good  theoretical  background,  in  practice, 
the  recursive  functions  are  not  typically  used.  Recursive 
calls  have  their  limitations  in  the  call  stack  and  convert¬ 
ing  such  recursion,  where  the  recursive  call  is  within  a 
complex  function  argmin ,  into  iteration  is  possible,  but 
rather  cumbersome.  Instead,  the  idea  of  relaxed  explo¬ 
ration  is  typically  utilized. 

The  relaxed  exploration  is  in  fact  a  reachability  analy¬ 
sis  of  the  relaxed  planning  problem,  which  can  be  conve¬ 
niently  seen  as  building  a  relaxed  planing  graph  (RPG). 
A  relaxed  planning  graph  is  a  layered  (alternating  fact 
and  action  layers)  directed  graph.  In  its  first  layer  it 
contains  all  facts  which  hold  in  the  initial  state,  the 
next  layer  contains  all  actions  of  which  preconditions 
are  satisfied  in  the  previous  layer  (and  noop  actions), 
the  next  layer  contains  all  (add)  effects  of  the  actions 
from  previous  layer  and  so  forth.  In  practice,  a  RPG 
is  not  built  explicitly,  but  the  exploration  is  achieved 
via  effective  representation  we  will  refer  to  as  an  explo¬ 
ration  queue  (based  on  the  Fast-Downward  planning 
system  (Helmert  2006)). 

The  exploration  queue  considers  only  unary  ac¬ 
tions — actions  which  have  a  single  proposition  as  an 
add  effect  (any  relaxed  problem  can  be  converted 
so  it  contains  only  unary  actions).  The  exploration 
queue  is  supported  by  a  data  structure  representing  the 
precondition-of  and  achieved-by  relations.  The  queue  is 
initialized  with  the  propositions  which  are  true  in  the 
state  s.  Until  the  queue  is  empty,  a  proposition  p  is 
polled,  it  is  checked  whether  p  is  a  goal  proposition  and 
if  so,  whether  all  goals  are  satisfied.  If  not,  for  each  ac¬ 
tion  that  depends  on  p  (p  €  pre(a)  where  a  €  a Uaproj), 
the  action  cost  is  incremented  by  the  cost  of  proposition 
p  (that  is  either  added  for  hadd,  or  maxed  for  hmax)  and 
if  there  are  no  more  unsatisfied  preconditions  of  the  ac¬ 
tion,  the  action  is  applied.  The  process  is  detailed  in 
Algorithm  2,  lines  1-13.  Thanks  to  the  sole  use  of  unary 
operators,  the  application  of  an  action  a  can  be  inter¬ 
preted  as  adding  the  (only  one)  proposition  p  =  eff(a) 
to  the  exploration  queue,  thus  the  procedure  is  named 
enqueueProposition  (line  10). 

The  effectiveness  of  this  approach  lays  in  fact,  that 
during  the  relaxed  exploration,  cost  estimates  of  facts 
and  actions  can  be  conveniently  computed  and  once  all 
goal  facts  are  reached,  the  heuristic  can  be  computed  by 
simple  sum  or  max  of  costs  of  all  goal  facts  respectively. 

Distributed  relaxed  exploration.  An  algorithm 
capable  of  building  RPGs  in  a  distributed  manner  was 
presented  in  (Stolba  and  Komenda  2013).  The  major 
drawback  of  the  used  approach  was  the  necessity  to 
build  the  RPG  for  each  state  by  all  agents  at  once, 
thus  preventing  the  search  to  run  independently  in  par¬ 
allel.  It  was  shown,  that  the  resulting  heuristic  estimate 
is  equal  to  the  centralized  estimate.  In  this  paper,  we 
will  not  place  the  requirement  of  obtaining  the  same 
value  as  in  the  centralized  variant,  which  will  allow  us 


Algorithm  3  Request  Processing 

Input:  Request  message  Mreq  =  ( s,a,S ),  where  s  is 
state,  a  action,  S  recursion  depth,  /3  the  sender 

processRequest^Aeq  =  ( s,a,S ),  (3 ): 

1:  St-  {s} 

2:  relaxedExploration(false) 

3:  h  «—  compute  the  total  cost 

4:  P  <r- mark  public  actions 

5:  send  reply  message  Mre  =  (h,  P ,  5}  to  /? 


Algorithm  4  Reply  Processing 

Input:  Reply  message  Mre  =  (h,P,S),  where  h  is 
heuristic  estimate,  P  set  of  actions,  6  recursion 
depth 

processReply(Mre  =  ( h,P,8 )): 

1:  if  <5  <  6max  then 
2:  hsum  i  h 

3:  for  all  a  £  P  do 

4:  send  request  message  Mreq  =  ( s ,  a,  S  +  1) 

to  owner  (a), 
process  the  reply  by 

receiveReplyCallback(h) 

5:  end  for 

6:  end  if 

7 :  receive  Reply  EnqueueCallback  ( _,h) 


Input:  Heuristic  estimate  h 

receiveReplyCallback(h): 

8:  hsurn  i  hsum  -{-  h 
9:  if  all  replies  received  then 
1 0 :  receiveReplyEnqueueCallback  (-,hsum) 

11:  end  if 


to  build  a  much  more  efficient  algorithm.  The  algorithm 
is  based  on  building  the  exploration  queue  and  request¬ 
ing  other  agents  when  projections  of  their  actions  are 
encountered.  Moreover  the  presented  algorithm  allows 
for  precise  control  of  the  recursion  depth  and  thus  en¬ 
ables  us  to  trade-off  the  estimation  precision  with  the 
computation  and  communication  complexity. 

The  basic  process  of  building  the  exploration  queue 
Q  is  similar  to  the  centralized  version  as  described  in 
the  previous  section.  The  main  principle  of  the  dis¬ 
tributed  process  is  that  whenever  a  projection  of  some 
other  agent’s  action  should  be  applied  (and  its  effect 
added  to  the  queue),  a  request  is  send  to  the  owner 
of  the  action  to  obtain  its  true  cost.  The  effect  of  the 
action  is  added  to  the  queue  only  after  the  reply  is  re¬ 
ceived.  Note,  that  when  computing  the  reply,  the  agent 
may  need  to  send  requests  as  well,  thus  ending  up  with 
a  distributed  recursion.  In  order  to  effectively  handle 
the  recursion  it  is  flattened  so  that  all  requests  are  sent 
by  the  initiator  agent  and  the  replies  are  augmented 


with  the  parameters  of  the  next  recursive  call. 

The  exploration  part  of  the  algorithm  is  shown  in 
Algorithm  2,  whereas  Algorithms  3  and  4  details  the 
inter-agent  communication.  The  entry  point  of  the  al¬ 
gorithm  is  the  relaxedExploration  procedure.  First, 
it  is  invoked  with  the  r  parameter  set  to  true,  indicat¬ 
ing,  that  whenever  a  projected  action  is  encountered,  a 
request  is  sent  to  its  owner. 

The  main  difference  between  the  centralized  and  dis¬ 
tributed  approaches  lays  in  the  enqueueProposition 
procedure.  If  the  cost  of  the  action  improves  the  cur¬ 
rent  cost  of  the  proposition,  the  cost  of  the  proposition 
is  set  equal  to  the  cost  of  the  action,  as  usual,  but  if 
the  the  action  a  G  apro:>  is  a  projection  and  sending  of 
requests  is  enabled,  i.e.  r  =  true,  a  request  message 
Mreq  =  (s,a,  5),  where  s  is  the  current  state,  a  is  the 
action  and  initial  recursion  depth  S  =  0,  is  sent  to  the 
owner  of  the  action  a,  i.e.  agent  (3.  Otherwise,  the 
proposition  is  added  to  the  exploration  queue. 

Processing  the  messages.  When  the  request  mes¬ 
sage  is  received  by  the  agent  /?  (see  Algorithm  3,  pro- 
cessRequest),  the  relaxed  exploration  is  run  with  the 
goal  being  the  preconditions  of  the  requested  action  a 
and  without  sending  any  requests,  i.e.,  r  =  false.  Af¬ 
ter  finishing  the  exploration,  public  actions  which  have 
contributed  to  the  resulting  heuristic  estimate  are  de¬ 
termined  (line  4).  In  principle,  the  procedure  is  similar 
to  extracting  a  relaxed  plan  in  the  FF  heuristic.  A  re¬ 
ply  Mre  =  ( h ,  P,  6}  is  sent,  where  h  is  the  computed 
heuristic  value,  P  is  the  set  of  the  contributing  public 
actions  and  S  is  the  current  recursion  depth. 

Receiving  the  reply  from  agent  /3  is  managed  by  pro¬ 
cedure  processReply  in  Algorithm  4.  If  the  recur¬ 
sion  depth  has  already  reached  the  limit  S  >  5max,  the 
original  receiveReplyEnqueueCallback(p,  h)  from  Algo¬ 
rithm  3  for  action  a  is  called,  the  cost  estimate  of  propo¬ 
sition  p  is  finalized  and  p  is  added  to  the  exploration 
queue.  Since  the  messaging  process  is  asynchronous, 
the  original  relaxed  exploration  has  already  terminated, 
therefore  it  is  started  again  (line  26),  with  the  original 
data  structures  and  with  the  newly  evaluated  propo¬ 
sition  added  to  the  queue.  When  the  exploration  is 
finished  and  there  are  no  pending  requests,  the  final 
heuristic  estimate  is  computed  depending  on  the  actual 
heuristic  ( sum  or  max)  and  is  returned  via  a  callback  to 
the  search,  so  that  the  evaluated  state  can  be  expanded. 

Otherwise,  if  5  <  Smax,  agent  a  iterates  through  all 
actions  a'  G  P  and  sends  requests  to  their  respective 
owners.  The  heuristic  estimate  received  in  each  reply  is 
added  to  the  shared  hsum.  When  all  replies  are  received 
(the  replies  undergo  the  same  procedure,  if  there  are  any 
other  public  actions  involved)  and  all  costs  are  added 
together,  again  the  receiveReplyEnqueueCallback(p,  h) 
from  Algorithm  2  is  called  with  h  =  hsum. 

The  processReply  procedure  stands  for  the  dis¬ 
tributed  recursion,  but  the  deeper  recursive  call  is  not 
called  by  the  agent  (3 ,  but  the  parameters  of  the  recur¬ 
sion  (the  set  of  actions  P  which  should  be  resolved  next) 


are  sent  back  to  the  initiator  agent.  This  is  rather  an 
optimization  which  prevent  us  from  the  need  of  having 
multiple  heuristic  evaluation  contexts  needed  to  handle 
multiple  interwoven  request/reply  traces.  Each  context 
would  need  to  have  separate  instance  of  the  exploration 
queue  data  structure,  which  would  present  major  inef¬ 
ficiency.  Instead,  the  initiator  agent  is  responsible  for 
tracking  the  recursion  and  the  replying  agent  only  pro¬ 
cesses  one  reply  at  a  time,  locally,  without  sending  any 
requests.  Therefore,  each  agent  needs  to  have  only  two 
instances  of  the  exploration  queue ,  one  used  to  com¬ 
pute  their  own  heuristic  estimates  (and  possibly  send 
requests  and  await  replies),  and  one  used  to  compute 
the  local  estimates  for  the  replies. 

Distribution  of  hpp 

The  Fast-Forward  hpF  heuristic  is  not  directly  based  on 
estimation  of  the  cost  of  actions  in  the  relaxed  problem, 
but  on  actually  finding  a  plan  solving  the  relaxed  prob¬ 
lem  (a  relaxed  plan  or  RP).  The  heuristic  is  not  typ¬ 
ically  described  using  recursive  equations,  but  the  im¬ 
plementation  based  on  relaxed  exploration  can  be  easily 
reused.  The  difference  is,  that  the  evaluation  does  not 
end  when  the  exploration  is  finished  (all  goal  proposi¬ 
tions  have  been  reached),  but  continues  with  the  relaxed 
plan  extraction.  The  extraction  of  RP  starts  with  the 
goal  propositions  and  traverses  the  data  structure  to¬ 
wards  the  initial  state,  while  marking  the  relaxed  plan. 

Since  the  algorithm  is  implementation-wise  very  sim¬ 
ilar  to  the  hadd  heuristic,  one  of  the  possible  approaches 
to  distribution  of  hpF  is  to  perform  the  distributed  re¬ 
laxed  exploration  exactly  as  in  hadd  and  simply  add  RP 
extraction  routine  at  the  end  of  the  heuristic  evaluation 
(as  part  of  the  total  cost  computation).  Another  ap¬ 
proach  was  introduced  in  (Stolba  and  Komenda  2013) 
as  lazy  multiagent  FF  heuristic  hiazyFF ,  which  we  have 
adopted  and  compared  with  the  previously  described 
approach  and  both  additive  and  max  heuristics. 

Our  version  of  the  lazy  FF  algorithm  starts  with  lo¬ 
cal  building  of  the  exploration  queue.  When  all  goal 
propositions  are  reached,  a  relaxed  plan  7r;  is  extracted. 
For  all  actions  a  G  tt\  which  are  projections  a  G  apro-l, 
request  message  Mreq  =  ( s ,  a,  6)  is  sent  to  the  owning 
agent  (3  =  owner(a)  of  the  action  a.  When  agent  (3 
receives  the  request,  he  constructs  a  local  relaxed  plan 
from  state  s  (by  local  relaxed  exploration  and  local  RP 
extraction  without  sending  any  requests),  satisfying  the 
preconditions  (both  public  and  private)  of  the  action  a. 
Then,  agent  /3  sends  a  reply  Mre  =  (h,P,S),  where  h. 
is  the  length  of  the  relaxed  plan  and  P  is  a  set  of  pro¬ 
jected  actions  contained  in  the  plan.  When  the  reply 
is  received  by  agent  a,  the  algorithm  iterates  through 
all  actions  a!  G  P  and  sends  requests  to  their  respec¬ 
tive  owners.  Each  of  the  requests  undergo  the  same 
procedure  as  the  original  request,  adding  the  returned 
heuristic  estimates  to  the  resulting  hsum.  When  all  re¬ 
quests  are  processed,  hsum  is  added  to  the  length  of  the 
local  relaxed  plan  of  agent  a  and  returned  via  callback 
as  the  heuristic  estimate  of  state  s. 


prob.  (|A|) 

&max 

t[s] 

hpF 

e[k-] 

b  [MB] 

t[s] 

T3  , - , 

6  [MB] 

t[s] 

hmax 

e[k-] 

6  [MB] 

t[s] 

hiazyFF 

e[k-] 

6[MB] 

Rov8  (4) 

l 

1.2 

0.4 

0.5 

1.2 

0.5 

0.7 

1.1 

0.4 

0.5 

- 

- 

- 

Rov8  (4) 

00 

1.1 

0.5 

0.7 

1.2 

0.6 

0.8 

1.1 

0.5 

0.6 

- 

- 

- 

Rov  12  (4) 

0 

70.9 

4617.3 

35.7 

40.7 

3939.5 

31.2 

- 

- 

- 

57 

4583.5 

35.6 

Rov  12  (4) 

1 

1.2 

0.8 

0.7 

1.3 

0.3 

0.2 

1.1 

0.3 

0.3 

- 

- 

- 

Rov  12  (4) 

00 

1.1 

0.4 

0.4 

1.2 

0.7 

0.6 

1.2 

0.3 

0.3 

- 

- 

- 

Rov  14  (4) 

1 

21 

112.1 

157.4 

18.7 

101.9 

136.6 

21.9 

127.2 

170.6 

- 

- 

- 

Rov  14  (4) 

00 

23 

124.7 

175.1 

19.2 

98.5 

138.7 

16.1 

90.6 

127.6 

- 

- 

- 

Sat9  (5) 

1 

3.2 

6.2 

14.7 

3.2 

7.8 

17.7 

3.1 

6.9 

15.6 

3.1 

9.8 

10.3 

Sat9  (5) 

oo 

3.7 

10.6 

25 

3.5 

7.4 

17.4 

3.3 

6.2 

14.6 

3 

7.4 

8.4 

SatlO  (5) 

1 

3.7 

4.1 

9.6 

3.4 

3.2 

7.1 

3.4 

2.4 

5.3 

4.1 

13.5 

6.6 

SatlO  (5) 

oo 

3.7 

4.3 

9.9 

4 

8.2 

19 

3.3 

2.3 

5.3 

3.9 

7.8 

6 

Sat*  (14) 

1 

69.2 

9.3 

36.9 

68 

8.7 

33.4 

69 

9 

34.8 

60.6 

9 

17.5 

Sat*  (14) 

oo 

69 

9.3 

36.7 

68.5 

9.3 

37 

68.5 

9 

35.7 

61.3 

9.7 

18.4 

Sat*  (16) 

1 

133.8 

11.1 

57.5 

136.6 

11.7 

59.4 

133.6 

10.7 

54.1 

126.4 

12.8 

31 

Sat*  (16) 

oo 

132.7 

11.1 

57.8 

137.3 

12.4 

64.6 

133.5 

11.1 

57.8 

124.5 

12.3 

28.8 

Log*  (6) 

0 

0.7 

6.8 

0 

0.7 

5.7 

0 

0.7 

7 

0 

0.7 

7.2 

0 

Log*  (6) 

1 

1.2 

0.5 

0.3 

1.6 

1.3 

0.6 

1.2 

0.7 

0.3 

1.6 

5.4 

0.6 

Log*  (6) 

00 

1.1 

0.4 

0.2 

1.3 

0.8 

0.5 

1.2 

0.5 

0.3 

1.4 

0.6 

0.8 

CP*  (6) 

0 

1.8 

136.1 

2.2 

1.6 

99.5 

1.7 

2.6 

351.5 

5.5 

1.8 

137.7 

2.2 

CP*  (6) 

l 

1.7 

127.5 

2.1 

7.3 

72.1 

51.6 

10 

100.2 

71.8 

5.6 

43.8 

66.1 

CP*  (6) 

00 

1.7 

122.5 

2 

1.4 

76.2 

1.3 

2.6 

352.6 

5.4 

44.3 

91.8 

973.5 

CP*  (7) 

0 

2.2 

183.2 

2.4 

2.2 

223.1 

3.2 

6.3 

1252.5 

15.7 

2.3 

205 

2.7 

CP*  (7) 

1 

1.9 

162.1 

2.2 

18.3 

248.1 

150.8 

35.5 

451.4 

274.5 

50.4 

371.2 

738.6 

CP*  (7) 

00 

2 

188.9 

2.5 

2.1 

225.2 

3.2 

6.3 

1255.6 

15.5 

160.9 

249.5 

249.5 

Sok*  (2) 

0 

1.6 

8.5 

0.5 

1.5 

7.7 

0.5 

1.7 

11.7 

0.7 

1.6 

8.8 

0.6 

Sok*  (2) 

1 

1.5 

7.6 

0.5 

17.1 

22 

66.8 

3.9 

4.3 

12.1 

4.3 

12.9 

24.1 

Sok*  (2) 

00 

1.5 

8.3 

0.5 

1.4 

7.8 

0.5 

1.6 

11.7 

0.7 

- 

- 

- 

Table  1:  Comparison  of  all  presented  heuristics  for  selected  problems  and  metrics.  The  planning  time  metrics 
t  is  in  seconds,  the  explored  states  e  in  thousands  of  states  and  the  communicated  information  b  in  megabytes. 
Abbreviations:  Rov  =  rovers,  Sat  =  satellites,  Log  =  logistics,  CP  =  cooperative  path-finding,  Sok  =  sokoban. 


The  recursion  depth  of  the  heuristic  estimate  can  be 
limited  in  a  similar  manner  as  in  the  add/max  heuris¬ 
tics.  Whenever  a  request  should  be  sent  and  the  maxi¬ 
mum  recursion  limit  6max  has  been  reached,  the  request 
is  not  sent  and  the  possible  relaxed  sub-plan  is  ignored. 

Experiments 

To  analyze  properties  of  the  proposed  distributed 
heuristics  and  their  implementations,  we  have  prepared 
a  set  of  experiments  covering  various  efficiency  aspects 
of  the  heuristics. 

All  experiments  were  performed  on  FX-8150  8-core 
processor  at  3.6GHz,  each  run  limited  to  8GB  of  RAM 
and  10  minutes.  Each  measurement  is  a  mean  from  5 
runs.  We  have  used  the  translation  to  SAS+  formalism 
and  preprocessing  from  the  Fast-Downward  planning 
system  (Helmert  2006)  and  new  implementation  of  the 
search  and  heuristic  estimators. 

Initial  comparison 

The  first  batch  of  experiments  focused  on  two  classical 
planning  metrics  used  in  comparison  of  heuristic  effi¬ 
ciency:  planning  time  t  and  number  of  explored  states 


e.  Those  metrics  were  supplied  by  a  multiagent  met¬ 
ric  of  communicated  bytes  b  among  the  agents  during 
the  planning  process.  Used  planning  problems  stem 
from  IPC  domains  modified  for  multiagent  planning  as 
presented,  e.g.,  in  (Nissim  and  Brafman  2012).  The 
problems  with  *  in  their  names  were  either  based  on 
IPC  domains,  but  simplified,  or  other  state-of-the-art 
multiagent  benchmarks,  e.g.,  from  (Komenda,  Novak, 
and  Pechoucek  2013).  The  recursion  depth  was  lim¬ 
ited  to  three  values  5max  =  {0,  l,oo}  as  other  settings 
of  Smax  showed  similar  results.  Missing  rows  were  not 
successfully  planned  with  any  of  the  tested  heuristics. 

The  results  are  summarized  in  Table  1.  No  single 
heuristic  and  Smax  dominates  the  other  ones.  In  Rovers, 
the  most  successful  seems  to  be  hmax.  In  Satellites, 
good  performing  is  hiazyFFi  but  it  looses  in  other  do¬ 
mains,  where  it  does  not  solve  some  problems  at  all. 
The  Logistics  is  dominated  by  Iiff  and  in  Cooperative 
Path-Finding  and  Sokoban,  the  best  are  hpF  and  hadd- 

Problem  coverage 

In  this  experiment,  we  have  evaluated  the  coverage  of  all 
the  described  heuristics  {hadd,hmax,hFF  and  hiazyFF ) 
with  the  maximum  recursion  depth  Smax  set  to  0, 


5 


5maa! 

0 

1 

2 

4 

OO 

Hff 

35  /7 

38/15.4 

38  /15 

38  /14.4 

38  /15 

hadd 

35  /7 

38  /14 

38  /14 

38  /14 

38  /14 

hmax 

35  /3.2 

38  /14 

38  /14 

38  /14 

38  /14 

hiazyFF 

35.2  /6.8 

38  /8 

36.2  /8 

36.5  /8 

36.8  /8 

Table  2:  Coverage  for  various  heuristics  and  recursions 
depth  Smax.  The  results  are  in  the  form  of  multiagent 
domains  /  IPC  domains. 


1,  2,  4  and  oo.  The  coverage  has  been  evaluated 
over  two  sets  of  benchmarks.  First  set  consists  of 
40  specifically  multiagent  problems,  which  are  typi¬ 
cally  not  that  combinatorially  hard,  but  contain  more 
agents  (taken  from  (Stolba  and  Komenda  2013)  and 
(Komenda,  Novak,  and  Peclioucek  2013)).  Second  set 
consists  of  21  problems  converted  directly  form  IPC 
benchmarks  (as  in  (Nissim  and  Brafman  2012)),  which 
are  typically  much  combinatorially  harder,  but  with  less 
agents.  The  results  are  summarized  in  Table  2. 

The  results  show  clear  dominance  of  h ff,  but  in¬ 
terestingly  the  other  distribution  approach  of  the  Fast- 
Forward  heuristic,  hiazypF ,  is  on  the  other  side  of  the 
spectrum.  This  is  most  probably  because  one  of  the 
biggest  strengths  of  the  FF  heuristic,  compared  to  other 
delete  relaxation  heuristics  used  here,  is  that  it  does 
not  suffer  from  over-counting  (one  action  is  included  in 
the  estimate  several  times)  thanks  to  the  explicit  re¬ 
laxed  plan  extraction.  In  the  hiazypF ,  we  partially  lose 
this  advantage,  because  when  sending  reply,  only  the 
length  of  the  plan  is  sent.  Therefore,  single  action  can 
be  included  several  times  in  multiple  replies  from  single 
agent,  or  even  multiple  agents. 

Another  message  the  table  conveys  is  that  the  setting 
of  6max  =  0  is  dominated  by  other  values.  This  may  be 
due  to  the  choice  of  the  domains,  the  effect  of  various 
Smax  settings  is  thoroughly  analyzed  in  the  next  set  of 
experiments.  Also,  various  settings  of  Smax  for  5max  > 
0  affect  the  coverage  only  marginally. 

We  can  also  state,  that  in  the  terms  of  coverage, 
the  use  of  distributed  heuristics  compares  favorably 
to  the  state  of  the  art,  as  in  (Nissim  and  Brafman 
2012),  the  planner  using  projected  heuristics  LM-cut 
and  Merge&Shrink  has  a  coverage  of  11  resp.  12  prob¬ 
lems,  which  was  exceeded  by  all  distributed  heuristic 
with  Smax  >  0,  except  for  hiazyFF-  But  it  is  important 
to  emphasize,  that  the  comparison  is  not  completely 
fair,  as  the  approach  used  in  (Nissim  and  Brafman 
2012)  is  optimal  and  although  we  use  the  admissible 
hmax  heuristic,  our  approach  is  not  optimal.  To  do  so 
requires  special  handling  of  the  search  termination,  as 
detailed  in  (Nissim  and  Brafman  2012). 

Effect  of  the  recursion  depth 

In  the  following  set  of  experiments,  we  have  evaluated 
the  effect  of  changing  the  maximal  recursion  depth  Smax 
on  the  speed  and  communication  requirements  of  the 
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Figure  1:  Planning  time  normalized  to  result  for  of 
$max  —  00  for  hiazyFF  heuristics. 


planning  process.  The  data  set  was  measured  on  four 
selected  domains  with  varied  couplings  (rovers,  satel¬ 
lites,  cooperative  path- finding  and  logistics),  each  rep¬ 
resented  by  a  single  problem.  The  maximal  recursion 
depth  ranged  from  0  (a  projected  heuristic)  to  9,  for 
comparison,  the  results  were  normalized  against  the  re¬ 
sult  of  run  with  6max  =  oo. 

By  coupling,  we  understand  the  concept  formalized  in 
(Brafman  and  Domshlak  2008) ,  which  can  be  rephrased 
as  “the  more  interactions  must  take  place  among  the 
agents  in  order  to  solve  the  problem,  the  more  coupled 
the  problem  is” — at  one  extreme  there  are  problems, 
where  all  actions  interact  with  other  agents  (containing 
only  public  actions)  meaning  full  coupling.  In  prob¬ 
lems  of  the  other  extreme,  the  agents  can  solve  their 
individual  problems  without  any  interaction.  Because 
of  our  decision  to  treat  all  goals  as  public,  we  cannot 
achieve  full  decoupling — at  least  goal-achieving  actions 
are  public  and  thus  causing  some  level  of  coupling.  The 
experimental  domains  were  chosen  such  that  rovers  and 
satellites  are  loosely  coupled.  In  satellites,  only  the  as¬ 
sumption  that  all  goal-achieving  actions  are  public  in¬ 
troduces  some  coupling,  in  rovers,  there  are  also  inter¬ 
acting  preconditions  among  the  goal-achieving  actions. 
Logistics  is  quite  balanced  (private  movement  of  agents 
and  public  handling  of  packages)  and  cooperative  path¬ 
finding  is  fully  coupled. 

The  experimental  results  for  the  hiaZyFF  heuristic  are 
plotted  in  Figures  1  and  2.  In  the  fully  coupled  coop¬ 
erative  path-finding,  the  results  are  best  for  Smax  —  0 
and  are  converging  to  the  results  for  Smax  =  oo  as  Smax 
grows.  This  is  because  in  a  fully  coupled  problem,  all 
actions  are  public  and  in  cooperative  path-finding  all 
their  preconditions  and  effects  are  also  public  (which 
does  not  have  to  be  always  the  case).  Therefore  each 
agent  has  complete  information  about  the  problem  in 
form  of  the  action  projections  (a“  =  a  for  all  actions 
and  agents)  and  the  projected  heuristic  gives  perfect  es- 
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Figure  2:  Communicated  bytes  and  heuristic  message  requests  normalized  to  5max  =  oo  for  hiazyFF  heuristics. 


timate  (the  same  as  would  global  heuristic  give).  For 
&max  >  0,  requests  are  sent  for  every  projected  action, 
causing  more  communication  and  computation  without 
bringing  any  improvement  to  the  heuristic  estimate. 

Completely  different  picture  give  the  results  for  the 
loosely  coupled  problems.  The  results  are  significantly 
worse  for  Smax  =  0,  from  6max  =  1  they  are  practically 
equal  to  6maX  =  oo.  The  solution  of  those  problems  typ¬ 
ically  consist  of  long  private  parts  finished  by  a  singe 
public  action  (the  goal  achieving  action).  When  es¬ 
timated  by  a  projected  heuristic ,  the  private  parts  of 
other  agents  get  ignored  and  the  estimates  are  thus 
much  less  informative.  Even  the  fact,  that  when  a  state 
is  expanded  by  a  public  action,  it  is  sent  with  the  orig¬ 
inal  agent’s  heuristic  estimate,  does  not  help,  because 
estimation  of  states  expanded  further  from  such  state 
ignore  the  information  again.  But  even  Smax  =  1  is 
enough  to  resolve  this  issue. 

Lastly,  in  the  balanced  logistics  problem,  the  5max  = 
0  estimates  are  rather  good  (but  not  as  good  as  in  the 
cooperative  path- finding)  and  with  growing  Smax ,  the 
results  converge  towards  5max  =  oo,  but  for  0  <  6max  < 
oo  the  results  are  slightly  worse.  This  may  suggest, 
that  as  the  coupling  is  balanced,  it  is  best  either  to 
fully  exploit  the  coupled  part  of  the  problem  and  use 
projected  heuristics ,  or  to  rely  on  the  decoupled  part 
of  the  problem  and  employ  the  full  recursion  approach, 
depending  on  the  exact  balance. 

The  results  for  communication  are  in  Figure  2.  The 
left  chart  compares  the  total  bytes  communicated  and 
shows  the  same  tendencies  as  the  planning  time  in  Fig¬ 
ure  1.  In  fact,  limiting  the  interactions  may  lead  to  in¬ 
creased  communication.  The  right  table  shows  the  data 
for  heuristic  requests,  there  we  see  the  expected  result 
for  5max  =  0,  where  no  requests  are  sent,  otherwise 
the  tendencies  are  surprisingly  similar.  This  indicates, 
that  the  communication  complexity  is  dominated  by 
the  search  communication  complexity  (the  longer  the 
search  takes,  the  more  messages  are  passed). 


Those  rather  unintuitive  results  suggest,  that  for 
tightly  coupled  problems,  sharing  of  the  information  is 
not  only  less  important,  because  the  agents  have  most  of 
the  information  in  their  problem  projections,  but  may 
even  lower  the  effectiveness  because  of  redundant  com¬ 
putation.  On  the  other  hand,  for  loosely  coupled  prob¬ 
lems,  the  communication  is  vital,  even  if  the  commu¬ 
nication  is  very  limited.  For  balanced  problems,  both 
extremes  are  equally  good.  In  general,  it  is  hard  to  de¬ 
termine,  which  approach  will  yield  the  best  results,  but 
it  is  sensible  to  choose  from  either  no  communication 
Smax  =  0,  full  communication  Smax  =  oo,  or  even  com¬ 
munication  limited  to  very  low  recursion  depth  limits, 
i.e.,  6rnax  =  1.  If  we  can  expect  some  properties  of  the 
problems  at  hand,  we  can  suggest  preferred  approach 
much  easier — if  we  are  not  expecting  loosely  coupled 
problems,  6max  =  0  is  the  best  choice,  for  no  tightly 
coupled  problems  6max  =  oo  and  for  no  balanced  prob¬ 
lems,  S max  =  1  seems  to  be  the  best  choice. 

The  results  in  the  presented  figures  are  for  hiazyFF 
mainly  because  they  are  the  most  illustrative,  other 
heuristics  follow  the  same  patterns  as  described  here. 

Final  Remarks 

We  have  proposed  an  efficient  distribution  approach  for 
three  classical  delete-relaxation  heuristics  /wd,  hmax 
and  hpF-  The  heuristics  were  experimentally  compared 
in  a  planner  utilizing  a  mutliagent  Best-First  Search  on 
various  multiagent  planning  problems  stemming  from 
classical  IPC  domains.  The  comparison  comprised  met¬ 
rics  of  planning  time  and  communication  and  expanded 
states,  coverage  comparison  and  comparison  of  the  ef¬ 
fect  of  changing  maximum  recursion  depth. 

The  results  did  not  show  any  single  heuristic  to  be 
dominating  others,  but  brought  to  light  interesting  and 
rather  unintuitive  conclusion.  For  tightly  coupled  prob¬ 
lems,  it  is  more  efficient  to  limit  the  information  shar¬ 
ing,  whereas  loosely  coupled  problems  strongly  benefit 
form  the  distributed  heuristic  estimation. 
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Abstract 

Use  of  heuristics  in  search-based  domain-independent 
deterministic  multiagent  planning  is  as  important  as 
in  classical  planning.  In  this  work  we  propose  a  formal 
and  an  algorithmic  adaptation  of  a  well-known  heuris¬ 
tic  Fast-Forward  into  multiagent  planning.  Such  treat¬ 
ment  is  important  as  it  solves  challenges  in  decentral¬ 
ization  of  this  and  other  heuristics  based  on  relaxation 
of  the  original  planning  problem.  Such  decentraliza¬ 
tion  enables  global  heuristic  estimates  to  be  computed 
without  exposing  local  information.  Additionally,  since 
Fast-Forward  heuristic  is  based  on  relaxed  planning,  we 
propose  a  multiagent  approach  for  building  factored  re¬ 
laxed  planning  graphs  among  the  agents.  We  sketch 
proofs  that  the  results  of  the  distributed  version  of  the 
algorithm  gives  the  same  results  as  the  centralized  ver¬ 
sion.  Finally,  we  experimentally  validate  different  dis¬ 
tribution  strategies  of  the  heuristic  estimate. 

Introduction 

In  recent  year's  the  landscape  of  multiagent  planning 
research  has  changed  by  Brafman  and  Domshlak’s  for¬ 
mal  treatment  and  promising  complexity  results  of 
domain-independent  deterministic  multiagent  planning 
(DMAP)  (Brafman  and  Domshlak  2008)  represented  as 
an  extension  of  STRIPS  for  more  agents.  An  important 
piece  of  the  puzzle  was  a  decomposition  of  a  planning 
problem  common  for  all  the  agents.  In  principle,  the 
ideas  behind  relate  to  the  research  of  planning  problem 
factorization  and  utilization  of  such  for  more  efficient 
solving  of  classical  planning  problems.  Therefore  even 
for  cooperative  agents,  it  is  reasonable  to  hide  parts  of 
the  information  used  during  planning  from  other  agents 
as  this  helps  (in  loosely  coupled  problems)  the  agents 
to  focus  only  on  their  parts  of  the  problem. 

After  this  publication,  the  community  started  to  de¬ 
sign  and  implement  first  planners  using  the  principles 
of  DMAP  described  in  the  Brafman  and  Domshlak’s 
paper.  The  first  one  from  Nissim  et  al.  (Nissirn,  Braf¬ 
man,  and  Domshlak  2010)  was  built  on  distributed  con¬ 
straint  satisfaction  problem  solver  and  a  forward  chain¬ 
ing  planner.  This  approach  precisely  followed  the  ideas 
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in  (Brafman  and  Domshlak  2008),  however  exposed  a 
couple  of  issues  making  the  approach  incomparable  in 
efficiency  with  current  state-of-the-art  implementations 
of  classical  planners.  One  of  the  issues  was  bad  scalabil¬ 
ity  with  growing  length  of  the  coordination  part  of  the 
resulting  plans.  Improvement  of  scalability  was  pro¬ 
posed  in  (Nissim  and  Brafman  2012)  by  leaving  the 
DisCSP+Planning  approach  and  moving  to  a  princi¬ 
ple  which  is  currently  the  most  successful  in  classical 
planning — A*  or  variations  on  Best  First  Search  (BFS) 
with  highly  informed  automatically  derived  heuristics. 

Since  the  motivation  of  (Nissim  and  Brafman  2012) 
was  to  propose  an  optimal  planner  (MA-A*),  the 
heuristics  used  were  LM-cut  (Helmert  and  Domsh¬ 
lak  2009)  with  pathmax  equation  and  merge-and- 
shrink  (Helmert,  Haslurn,  and  Hoffmann  2007).  In  the 
distributed  search  approach,  the  heuristics  were  used 
only  with  local  information  of  the  respective  agent,  i.e., 
with  its  internal  actions,  its  public  actions  and  pro¬ 
jections  of  other  agents’  public  actions.  In  discussion 
of  (Nissim  and  Brafman  2012),  the  authors  state  that 
“the  greatest  practical  challenge  [. . .]  is  that  of  comput¬ 
ing  a  global  heuristic  by  a  distributed  system ”,  which 
is  precisely  our  focus  in  this  work.  According  to  our 
knowledge,  there  is  no  work  proposing  efficient  plan¬ 
ners  for  DMAP  not  focused  on  optimality  of  the  result¬ 
ing  plans.  In  the  field  of  classical  planning,  on  the  other 
hand,  the  best  performing  planners  as  Fast  Downward 
and  LAMA  incorporate  a  fast,  but  suboptimal  search 
algorithm  using  non-admissible  heuristics. 

In  this  work  we  propose  a  formal  and  algorithmic 
adaptation  of  a  well-known  relaxation  heuristic  Fast- 
Forward  hFF  (Hoffmann  and  Nebel  2001)  into  multia¬ 
gent  planning.  We  argue  that  such  treatment  is  im¬ 
portant  as  it  demonstrates  algorithmic  challenges  in 
decentralization  of  computation  of  hFF  and  other  re¬ 
lated  heuristics.  Additionally,  since  the  hFF  heuristic 
is  based  on  relaxed  planning,  we  propose  a  multiagent 
(MA)  approach  for  building  factored  relaxed  planning 
graphs  among  the  agents.  We  sketch  proofs  that  the 
results  of  the  distributed  version  of  the  algorithm  gives 
the  same  results  as  the  centralized  version.  Finally,  we 
experimentally  validate  two  distribution  strategies  of 
the  heuristic  estimate  against  the  local  estimate. 


Multiagent  Planning 


We  consider  a  number  of  cooperative  and  coordinated 
agents  featuring  distinct  sets  of  capabilities  (actions), 
which  concurrently  plan  and  execute  their  local  plans 
in  order  to  achieve  a  joint  goal.  The  world  wherein 
the  agents  act  is  classical  and  the  actions  are  deter¬ 
ministic.  The  following  formal  preliminaries  compactly 
restate  the  MA-Strips  problem  (Brafman  and  Domsh- 
lak  2008)  required  for  the  following  sections. 

A  MA-Strips  planning  problem  is  a  quadruple  II  = 
(C,A,so,Sg),  where  £  is  a  set  of  propositions,  A  is  a 
set  of  agents  oq, . . . ,  a^,  s0  is  an  initial  state  and  Sg 
is  a  set  of  goal  states.  A  state  s  C  C  is  a  set  of  atoms 
from  a  finite  set  of  propositions  £  =  {pi, . . .  ,pm}  which 
holds  in  s.  An  action  an  agent  can  perform  is  a  tuple 
a  =  (pre(a),  add(a).  del(a)),  where  a  is  a  unique  action 
label  and  pre(a),  add(a),  del(a)  respectively  denote  the 
sets  of  preconditions,  add  effects  and  delete  effects  of  a, 
taken  from  £.  Act  denotes  the  set  of  all  actions  in  the 
multiagent  planning  problem  II,  i.e.,  Act  =  U a&Aa- 

An  agent  a  =  {ai, . . . ,  an }  is  characterized  precisely 
by  its  capabilities,  a  finite  repertoire  of  actions  ai  €  Act 
it  can  preform  in  the  environment.  MA-Strips  prob¬ 
lems  distinguish  between  the  public  and  internal  facts 
and  actions.  Let  atoms(a)  =  pre(a)Uadd(a)Udel(a)  and 
similarly  atoms(a)  =  |Ja6a  atoms(a).  An  a-internal 
and  public  subset  of  all  facts  £  will  be  denoted  as  £a~int 
and  £pub  respectively,  where  £a~lnt  =  atoms(a)  \ 
U pc.A\a-  atoms(/3)  and  £pub  =  atoms(a)  \  £“-mt.  Facts 
relevant  only  for  one  agent  a  are  denoted  as  Ca  = 
£a-mt  y£pub  an(j  a  projection  of  a  state  sa  to  an  agent 
a  is  a  subset  of  a  global  state  s  containing  only  pub¬ 
lic  facts  and  a-internal  facts,  formally  sa  =  s  (~1  Ca. 
The  set  of  public  actions  of  agent  a  is  defined  as 
apub  =  {a  |  a  e  a,  atoms(a)  n  £pub  ^  0}  and  inter¬ 
nal  actions  as  amt  =  a  \  apub.  The  symbol  aa  will 
denote  a  projection  of  action  a  £  /3,/3  A  a  for  agent 
a,  i.e.,  action  stripped  of  all  other  agents’  propositions, 
formally  atoms(a“)  =  atoms(a)  D  Ca. 

Note  that  all  actions  of  an  agent  a  uses  only  agent’s 
facts,  formally  Va  €  a  :  atoms(a)  C  £“  by  definition 
in  (Brafman  and  Domshlak  2008).  The  goal  set  Sg 
of  a  multiagent  planning  problem  will  be  treated  as 
public  (Nissim  and  Brafman  2012),  therefore  all  goal- 
achieving  actions  are  public.  In  the  following  sections, 
as  an  algorithm  for  multiagent  planning,  we  will  assume 
the  MA-A*  from  (Nissim  and  Brafman  2012),  but  with 
a  novel  distribution  of  the  hFF  heuristic. 

As  a  running  example,  we  will  use  a  simple  logistics 
problem  (see  Figure  1)  in  a  multiagent  setting.  There 
are  two  cities  each  with  two  locations  A.  B  and  C,  D 
and  one  package  p.  A  and  D  represent  depots  and  B,C 
airports.  Three  agents  represent  two  cargo  trucks  ti,t2 
(moving  only  within  the  cities)  and  one  airplane  a  (mov¬ 
ing  only  between  airports  B  and  C).  The  goal  is  to 
transport  the  package  from  depot  A  to  the  other  de¬ 
pot  D. 


SIP- 


Figure  1:  A  running  example  is  an  instance  of  LOGIS¬ 
TICS  problem  with  three  agents  and  one  package. 


Agent  Relaxed  Planning  Graph 

Relaxation  is  a  way  of  simplifying  a  problem  by  re¬ 
moving  some  constraints.  In  planning,  a  relaxation  is 
typically  obtained  by  removing  delete  effect  of  actions. 
Solution  of  such  relaxed  planning  problem  is  a  relaxed 
plan,  which  can  be  used  to  estimate  the  cost  of  a  plan 
in  the  original  problem,  e.g.,  the  Fast-Forward  heuris¬ 
tic  estimation  is  based  on  the  length  of  the  relaxed 
plan.  A  classical  technique  for  finding  the  relaxed  plan 
is  to  build  a  Relaxed  Planning  Graph  (RPG).  RPG  is  a 
graph  representing  the  reachability  of  facts  and  appli¬ 
cability  of  actions  in  the  relaxed  problem. 

Building  distributed  planning  graphs  (not  relaxed) 
was  studied  by  (Pellier  2010),  focusing  on  distribution 
of  the  Graphplan  algorithm.  Relaxed  MA  Planning 
Graphs  were  recently  studied  by  (Torreno,  Onaindia, 
and  Sapena  2012),  but  in  the  area  of  planning  with  in¬ 
complete  information  and  fluent  cost  estimation. 

To  obtain  a  more  informed  global  heuristic  estimate 
in  a  MA  planning  problem  using  the  estimation  based 
on  a  RPG,  the  RPG  has  to  be  decentralized.  In  this 
work,  we  propose  a  distributed  global  RPG  in  form 
of  a  set  of  distinct  Agent  RPGs.  Such  Agent  RPG 
( ARPG)  contains  only  facts  of  its  owner  agent.  The  ini¬ 
tial  state  is  projection  for  that  agent  and  since  the  goals 
are  treated  as  public,  all  agents  have  complete  goals  in 
their  ARPGs.  The  usage  of  actions  is  straightforward  in 
case  of  owner  agent’s  internal  and  public  actions  which 
are  used  equally  as  in  a  classical  RPG.  Additionally, 
the  Agent  RPGs  are  extended  by  projections  of  other 
agents’  public  actions  which  were  reachable  by  their 
particular  owners.  This  extension  enables  the  agents  to 
take  other  agents’  capabilities  into  account,  but  only  at 
the  time  points,  where  their  owners  are  able  to  reach 
them.  Similarly  to  relaxed  problems  in  STRIPS,  we 
define  a  relaxed  multiagent  planning  problem  in  MA- 
STRIPS  as  a  problem  stripped  of  delete  effects  in  all 
actions  of  all  agents: 

Definition  1.  An  agent  relaxed  planning  graph 
(ARPG)  is  a  directed,  labeled  and  layered  graph  lZ,a  = 
(P'  U  A',E')  of  one  particular  agent  a  for  a  relaxed 
multiagent  planning  task.  Let  II  =  (£,  A,  s0,  Sg)  be 
a  MA  planning  task,  then  a  relaxed  MA  planning  task 
II'  =  (£,  A' ,  so,  Sg)  contains  an  altered  set  of  agents  A', 
s.t.,  Va  €  A  and  a  =  {ai, . . . ,  a|^|}  there  exist  a  relaxed 
agent  a'  =  {a\ , . . . ,  aj^}  and  all  its  actions  are  relaxed 
versions  of  the  regular  actions  a!i  =  (pre(a*),  add(aj),  0). 
As  in  RPG,  the  nodes  of  the  graph  represent  proposi- 


tions  P'  and  actions  A'.  The  arcs  E'  represent  linkup 
of  propositions  and  actions. 

In  the  rest  of  the  paper,  the  discussion  will  be  only 
about  relaxed  structures,  therefore  we  will  omit  the 
prime  signs,  which  are  by  convention  used  to  denote 
relaxed  structures. 

ARPGs  stem  from  the  classical  RPGs,  therefore  an 
i-th  proposition  layer  and  action  layer  will  be  denoted 
as  Pi  and  A?;  respectively.  The  layers  alternate,  so  that 
( P0 ,  A0,  Pi,  Ax, ... ,  An_ly  Pn)  and  all  layers  Pi  C  P  and 
all  layers  Ai  C  A.  The  first  proposition  layer  P0  con¬ 
tains  nodes  labeled  by  propositions  of  the  agent’s  pro¬ 
jection  of  the  initial  state,  formally 

Po  =  {p\p  G  Sol- 

Each  action  layer  contains  action  nodes  for  all  applica¬ 
ble  relaxed  actions  of  the  agent  a  in  a  state  represented 
by  the  previous  fact  layer  and  external  projections  of 
other  agents’  public  actions  reachable  in  the  same  layer 

Ai  =  {a\a  G  a,  pre(a)  C  Pj}  U  [J  {ba\b  G  Pf}. 

fi€A,(3^a 

In  all  successive  fact  layers,  the  nodes  copy  the  previous 
fact  layer  according  to  the  frame  axiom  and  transforms 
the  facts  by  actions  in  the  previous  action  layer,  since 
for  all  relaxed  actions  del  (a)  =  0,  we  can  write 

Pi  =  Pi- 1  U  {p\p  G  add(a),  a  G  M- 1}- 

At  least  one  of  the  following  terminating  conditions  has 
to  hold  for  the  last  fact  layer  Pn: 

•  the  last  fact  layer  fulfills  the  goal  SG  Q  Pn, 

•  or  Pn  =  Pn_  i,  meaning  there  are  no  additional  ac¬ 
tions  which  can  extend  further  fact  layers  (a  fixed- 
point)  . 

The  arcs  in  ARPG  represent  applicability  and  applica¬ 
tion  of  actions  in  relaxed  states.  We  can  split  the  arcs 
between  two  fact  layers  Pi  and  Pi+ 1  into  three  groups. 
The  first  one  contains  arcs  among  facts  of  layer  Pi  and 
preconditions  of  actions  in  a  layer  A,  .  The  second  one 
contains  relation  between  effects  of  actions  and  next  in¬ 
duced  fact  layer  Pi+ i,  Additionally,  there  are  arcs  for 
all  facts  from  a  previous  layer  effectively  representing 
the  frame  axioms  of  the  closed  world  assumption.  For¬ 
mally, 

Efre  =  { (Pi  i  ai )  I  ft  G  pre  (a* ) ,  a,;  G  A* } , 

Etdd  =  {{ai,Pi+i)\ai  G  Ai,pi+\  G  add(oj)}, 

Ei  {(Pii Pi+l)\Pi  G  Pi, Pi- |-i  G  Pi-i-i,Pi  =ft+l} 

and  Ei  =  Efre  U  Efdd  U  Ef™.  Now  we  will  provide  an 
algorithm  for  distributed  building  of  ARPGs. 

Algorithm  The  algorithm  starts  with  each  agent 
building  an  ARPG  using  only  its  own  internal  and  pub¬ 
lic  actions.  An  iterative  process  is  then  initiated,  in 
which  the  agents  exchange  information  about  their  pub¬ 
lic  actions  and  extends  their  ARPGs  with  projected 


Algorithm  1  Distributed  build  of  Agent  Relaxed  Plan¬ 
ning  Graphs 

Input:  An  agent’s  factor  of  the  relaxed  MA  planning 
problem  11"  =  (£",  a,  Sg  ,  SG). 

Output:  Agent  Relaxed  Planning  Graph  7Z  for  a. 

1:  init(): 

2:  1Z  <—  Po  =  {p\P  G  Sq} 

3:  1Z  ^build-RPGCK,  a,  SG  ) 

4:  S  <r~map[apuh,integei'] 

5:  ack-count-G- idle-count-G-  0 
6:  checkQ 


7:  check(): 

8:  for  all  a  G  A„_ i  s.t.  a  is  public  do 
9:  if  a  ^  S  or  S  [a]  >  earliest  layer  of  appearance 

of  a  in  7 Z  then 

10:  S  [a]  earliest  appearance  of  a  in  7 Z 

11:  aek-count-s—  aek-count+ 1  A\ 

12:  V/7  G  A\  a  :  send(ext-a[a/3,  <S  [a]] ,  to  /3) 

13:  end  if 

14:  end  for 


15:  receive (ext-a[a“,  i  G  N],  from  /3  £  A  \  a): 
16:  Va  G  A  :  send(not-idle,  to  a) 

17:  send( ack,  to  /3) 

18:  1Z  <—  extend- RPG (7^,  [a“,  *]) 

19:  7 Z  <—  build-RPG(7t,  a,  SG  ) 

20:  check( ) 

21:  goal-reached () 

22:  receive  (ack): 

23:  ack-count  ack-count  -  1 
24:  goal-reached( ) 


25:  receive  (idle): 

26:  idle-count  <—  idle-count  +  1 
27:  if  idle-count  =  A  then 
28:  return  1Z 

29:  end  if 


30:  receive  (not -idle): 

31:  idle-count  <—  idle-count  -  1 


32:  goal-reached(): 

33:  if  Sg  G  TZ  and  ack-count  =  0  and  message  queue 
is  empty  then 

34:  Vcc  G  A  :  send(idle,  to  a) 

35:  end  if 


public  actions  of  other  agents.  The  algorithm  ter¬ 
minates  when  the  goal  (or  a  fixed-point)  is  globally 
reached  and  there  are  no  more  messages  to  process. 

The  pseudo-code  of  the  algorithm  is  given  in  Algo¬ 
rithm  1  in  an  event-driven  fashion.  The  events  can 
be  caused  either  by  receiving  a  message  or  internally 
by  the  algorithm  itself.  We  assume  that  messages  sent 
from  one  agent  arrive  in  the  same  order  as  they  were 


1) 


a: 

at-a-B 

fly-a-B-C 

at-a-B 

at-a-C 

fly-a-C-B 

at-a-B 

at-a-C 

tl: 

at-p-A 

at-tl-A 

load-tl-A 

drive-tl-A-B 

at-p-A 

at-tl-A 

at-tl-B 

in-p-tl 

load-tl-A 

unload-tl-A 

unload-tl-B 

drive-tl-A-B 

drive-tl-B-A 

at-p-A 

at-p-B 

at-tl-A 

at-tl-B 

in-p-tl 

load-tl-A 

load-tl-B 

unload-tl-A 

unload-tl-B 

drive-tl-A-B 

drive-tl-B-A 

at-p-A 

at-p-B 

at-tl-A 

at-tl-B 

in-p-tl 

t2: 

at-t2-D 

drive-t2-D-C 

at-t2-D 

at-t2-C 

drive-t2-D-C 

drive-t2-C-D 

at-t2-D 

at-t2-C 

2) 

a: 

at-a-B 

fly-a-B-C 

at-a-B 

at-a-C 

fly-a-C-B 

unload-tl-B(tl) 

at-a-B 

at-a-C 

at-p-B 

fly-a-C-B 

unload-tl-B(tl) 

load-a-B 

at-a-B 

at-a-C 

at-p-B 

in-p-a 

fly-a-C-B 

unload-tl-B(tl) 

load-a-B 

unload-a-B 

unload-a-C 

at-a-B 

at-a-C 

at-p-B 

in-p-a 

tl: 

at-p-A 

at-tl-A 

load-tl-A 

drive-tl-A-B 

at-p-A 

at-tl-A 

at-tl-B 

in-p-tl 

load-tl-A 

unload-tl-A 

unload-tl-B 

drive-tl-A-B 

drive-tl-B-A 

at-p-A 

at-p-B 

at-tl-A 

at-tl-B 

in-p-tl 

load-tl-A 

load-tl-B 

unload-tl-A 

unload-tl-B 

drive-tl-A-B 

drive-tl-B-A 

at-p-A 

at-p-B 

at-tl-A 

at-tl-B 

in-p-tl 

t2: 

at-t2-D 

drive-t2-D-C 

at-t2-D 

at-t2-C 

drive-t2-D-C 

drive-t2-C-D 

unload-tl-B(tl) 

at-t2-D 

at-t2-C 

3) 


at-a-B 

fly-a-C-B 

at-a-B 

fly-a-C-B 

at-a-B 

fly-a-C-B 

at-a-B 

at-a-C 

unload-tl-B(tl) 

at-a-C 

at-p-B 

unload-tl-B(tl) 

load-a-B 

at-a-C 

at-p-B 

in-p-a 

unload-tl-B(tl) 

load-a-B 

unload-a-B 

unload-a-C 

at-a-C 

at-p-B 

in-p-a 

tl: 

at-p-A 

at-tl-A 

load-tl-A 

drive-tl-A-B 

at-p-A 

at-tl-A 

at-tl-B 

in-p-tl 

load-tl-A 

unload-tl-A 

unload-tl-B 

drive-tl-A-B 

drive-tl-B-A 

at-p-A 

at-p-B 

at-tl-A 

at-tl-B 

in-p-tl 

load-tl-A 

load-tl-B 

unload-tl-A 

unload-tl-B 

drive-tl-A-B 

drive-tl-B-A 

at-p-A 

at-p-B 

at-tl-A 

at-tl-B 

in-p-tl 

load-tl-A 

load-tl-B 

unload-tl-A 

unload-tl-B 

drive-tl-A-B 

drive-tl-B-A 

unload-a-C(a) 

at-p-A 

at-p-B 

at-tl-A 

at-tl-B 

in-p-tl 

t2: 

at-t2-D 

drive-t2-D-C 

at-t2-D 

drive-t2-D-C 

at-t2-D 

drive-t2-D-C 

at-t2-D 

drive-t2-D-C 

at-t2-D 

drive-t2-D-C 

at-t2-D 

drive-t2-D-C 

at-t2-D 

at-t2-C 

drive-t2-C-D 

at-t2-C 

drive-t2-C-D 

at-t2-C 

drive-t2-C-D 

at-t2-C 

drive-t2-C-D 

at-t2-C 

drive-t2-C-D 

at-t2-C 

unload-tl-B(tl) 

unload-tl-B(tl) 

unload-tl-B(tl) 

at-p-C 

unload-tl-B(tl) 

at-p-C 

unload-tl-B(tl) 

at-p-C 

unload-a-C(a) 

unload-a-C(a) 

in-p-t2 

unload-a-C(a) 

at-p-D 

load-t2-C 


load-t2-C  in-p-t2 

unload-t2-D 


Figure  2:  Distributed  building  of  Agent  Relaxed  Planning  Graphs  decomposed  into  iterations. 


sent,  but  we  assume  no  ordering  between  messages  sent 
from  different  agents.  We  will  now  explain  each  event 
handling  routine. 

In  the  init  phase  a  Relaxed  Planning  Graph  1Z 
is  built  using  only  agent’s  own  actions  by  build-RPG 
method  from  the  initial  state  projection  Sq.  A  map 
S  used  to  store  the  earliest  layer  of  appearance 1  of  the 
agent’s  public  actions  is  initialized  along  with  other  sup¬ 
porting  data  structures  used  for  synchronized  termina¬ 
tion  of  the  algorithm.  After  the  initialization  phase, 
reaching  of  the  goal  (or  a  fixed-point)  is  checked,  and  if 
positive,  all  agents  are  informed  that  the  agent  is  idle 
now.  Next,  the  executed  check  procedure  is  responsi¬ 
ble  for  checking  whether  TZ  contains  any  public  actions. 
If  so,  each  action  is  sent  to  all  other  agents  /?  €  A\  a 
as  a  projection  a &  with  its  earliest  layer  of  appearance, 
unless  it  was  already  broadcasted  with  equal  or  lower 
number  of  layer  (this  can  happen  in  future  check  calls). 


1  Earliest  layer  of  appearance  of  an  action  a  in  (A)RPG 
is  the  first  action  layer,  where  a  is  applicable. 


In  the  next  part  of  the  algorithm,  there  are  four  mes¬ 
sage  handling  receive  procedures.  The  first  one  ext-a 
is  executed  when  a  projection  of  other  agent’s  public  ac¬ 
tion  is  received.  After  sending  control  messages,  the  ac¬ 
tion  is  integrated  into  TZ  on  the  i-tli  layer  by  extend-RPG 
method  and  the  change  is  propagated  by  build-RPG,  so 
that  all  actions  newly  applicable  in  the  following  lay¬ 
ers  are  applied  accordingly.  Then  the  built  ARPG  is 
checked,  whether  new  public  actions  (and  public  ac¬ 
tions  newly  applicable  on  earlier  layers)  are  reachable 
and  whether  the  goal  or  the  fixed-point  was  reached. 
The  last  three  receive  procedures  maintain  the  control 
information  needed  for  distributed  termination  detec¬ 
tion  (Mattern  1987).  The  acks  counter  keeps  track  of 
number  of  sent  external  actions  and  postpones  termi¬ 
nation  until  all  sent  actions  are  processed.  If  an  idle 
message  is  received,  there  are  no  pending  acks  and  the 
number  of  idle  agents  is  equal  to  |„4|,  the  algorithm  ter¬ 
minates  and  the  resulting  ARPG  TZ  is  returned.  Since 
not-idle  and  ack  messages  are  sent  in  this  particular  or¬ 
der  (lines  17  and  18)  and  the  messages  from  one  agent 


are  presumed  to  keep  ordering,  the  algorithm  termi¬ 
nates  synchronously  when  all  external  actions  are  pro¬ 
cessed  and  no  messages  are  pending. 

In  Figure  2,  the  Algorithm  1  is  applied  on  the  running 
example  depicted  in  Figure  1.  Although  the  algorithm 
is  running  asynchronously,  we  can  decompose  it  for  clar¬ 
ity  into  several  iterations.  In  the  first  iteration,  the 
ARPGs  are  built  using  only  the  actions  of  the  respec¬ 
tive  agents  a,  and  t2  (airplane  and  two  trucks).  Notice 
the  bold  green  action  unload-tl-B,  which  is  a  public  ac¬ 
tion  of  the  truck  ti ,  can  be  applied  thanks  to  the  initial 
position  of  the  package.  In  the  next  iteration,  projec¬ 
tion  of  the  public  action  is  broadcasted  and  received 
by  other  agents.  Upon  receiving,  their  ARPGs  are  up¬ 
dated,  which  for  the  airplane  means  that  the  ARPG 
is  expanded  with  further  layers.  Another  public  action 
unload-a-C  is  applied  and  therefore  broadcasted.  In  the 
third  iteration,  the  projection  of  the  airplane’s  unload 
action  is  added  to  the  ARPGs  of  the  trucks.  For  truck 
ti  it  has  no  effect,  but  it  allows  truck  t2  to  expand  the 
ARPG  and  reach  goal  at-p-D.  Notice,  that  when  the 
projected  unload-a-C(a)  was  received  by  truck  t2,  its 
ARPG  was  first  extended  to  have  enough  layers  for  the 
action  to  be  added  to  the  correct  layer. 

Although  not  shown  in  Figure  2,  the  algorithm  would 
continue  with  one  more  iteration  after  broadcasting  the 
public  action  reached  by  truck  t2,  resulting  in  all  agents 
having  ARPGs  with  the  same  number  of  layers  and  all 
having  reached  the  goal.  Additionally,  the  algorithm 
does  not  have  to  terminate  when  the  goal  is  reached, 
but  can  continue  until  the  fixed-point,  which  can  be 
desirable  in  some  situations  and  which  is  also  the  case 
when  the  goal  is  not  reachable. 

Proof  sketch  In  this  section,  we  will  sketch  a  proof 
showing,  that  the  Agent  Relaxed  Planning  Graphs  built 
by  Algorithm  1  are  compatible  with  a  global  RPG, 
meaning  that  they  contain  the  same  actions  (with  re¬ 
spect  to  projections)  and  that  the  actions  are  in  the 
same  layers.  We  will  use  this  proven  theorem  further  in 
a  proof  of  equality  of  the  centralized  FF  and  multiagent 
FF  (MAFF)  heuristics. 

Firstly,  we  will  formally  define  the  concept  of  com¬ 
patibility,  then  we  will  show  that  single  iteration  of  the 
algorithm  does  not  violate  the  compatibility,  and  finally 
we  will  show  that  the  algorithm  terminates,  i.e. ,  the  re¬ 
sulting  ARPGs  are  compatible  with  a  centrally  built 
RPG  and  that  no  actions  are  missing  or  are  superflu¬ 
ous. 

Let  II  =  (£,A,  soj  Sg)  be  a  relaxed  MA  planning 
task,  TZa  be  an  Agent  Relaxed  Planning  Graph  for 
agent  a  built  from  II  using  Algorithm  1,  having  al¬ 
ternating  layers  (Pg*,  Aq  ,  Pf ,  Af , . . . ,  A“_1;  P“)  and  let 

A°  =  Ui6(o,„—  i)A?  and  AA  =  U«g AAa-  Let  ft  = 
(£,  Act,  so,  Sg)  be  a  classical  relaxed  planning  task,  TZ 
be  a  classical  Relaxed  Planning  Graph  built  from  II 
having  alternating  layers  (Pq,  A0,  P\,A\,  . . . ,  A„_i,  Pn) 
and  let  A  =  U*e(0n_i>  N°te  that  from  the  mono¬ 


tonicity  of  (A)RPGs  follows  Vi  <  j  :  Pj  C  P;  and 
Vi  <  j  :  Ai  C  Aj . 

Definition  2.  Let  a  [>  Aj  denote  that  an  action  a 
is  first  applicable  in  layer  Aj  (formally  pre(a)  C  Pj  A 
pre(a)  Pj_i)  regardless  of  whether  the  underlying 
structure  is  RPG  or  ARPG. 

Definition  3.  We  define  that  a  set  of  ARPGs  R  = 
{TZa \a  €  A)  is  compatible  with  a  RPG  TZ  iff  for  each 
action  a  €  AA  for  which  a  t>  Aj  holds  the  following: 

1)  If  a  is  an  internal  action  of  agent  a,  then  a  t>  Af . 

2)  If  a  is  a  public  action  of  agent  a,  then  a  [>  Af  and 
V/3  €  A  \  a  :  a,P  t>  Af,  where  a &  is  the  projection  of 
action  a  for  agent  ft. 

Lemma  4.  A  set  of  ARPGs  R  =  {7Za \a  G  A}  compati¬ 
ble  with  a  RPG  TZ  stays  compatible  with  TZ  after  applica¬ 
tion  of  build-RPG  by  agent  a  and  successive  extend-RPG 
by  all  other  agents. 

Proof.  Let  us  have  a  RPG  TZ  built  from  a  relaxed  plan¬ 
ning  task  ft  and  a  set  of  ARPGs  R  =  {TZa\a  G  A} 
being  built  from  relaxed  MA  planning  task  II  using 
Algorithm  1.  The  symbol  AA  denotes  all  actions  ap¬ 
plied  in  the  algorithm  so  far  and  let  cnproj  be  the  set 
of  projected  actions  received  by  agent  a  so  far.  Now, 
agent  a  applies  build-RPG,  so  that  TZa  is  updated  by 
Af  =  Af  U{a|a  €  aUaproj  :  pre(a)  C  P“}  (and  accord¬ 
ingly  Pfi+1),  for  each  layer  Af.  Let  us  assume,  there 
exists  an  extra  action  a  €  a  which  was  newly  applied 
(a  £  AA)  and  for  which  a  [>  Aj  A  a  Af .  We  can  also 
assume  WLOG,  that  a  is  first  such  action  (in  terms  of 
layer  of  appearance). 

From  definition  of  a  >  A,;  ,  where  pre(a)  C  Pj  A 
pre(a)  ^  P?:-i  follows  that  pre(a)  C  P0  U  {p\b  G 
A,_i,p  G  add(6)}  and  pre(a)  Po  U  {p\b  G  A,_2,p  G 
add (6)}.  Because  a  is  first  action  for  which  a  t>  i,:Ao  ^ 
Af ,  for  all  actions  b  G  a,  for  which  holds  b  >  Aj,  where 
k  <  i,  holds  also  b  >  Af .  Therefore  Af  =  A*,  D  a  and 
Pf  =  PfeH  atoms(a)  for  all  k  <  i  and  therefore  pre(a)  C 
P“  A  pre(a)  PfiL i,  which  means  a  t>  Af  and  that  is  a 
contradiction.  Now,  we  can  assign  AA  <—  AAU{a}  and 
repeat  the  former  step. 

After  broadcasting  projections  of  the  newly  applied 
public  actions  and  calling  extend-RPG  by  all  other 
agents,  we  can  show  that  the  second  part  of  Defini¬ 
tion  3  also  holds.  Let  oQ  be  the  projection  of  an  action 
a  G  /?  which  is  broadcasted  first.  If  there  exists  some  i 
for  which  a  [>  Aj  then  pre(a)  C  Pj  and  for  the  projec¬ 
tion  aa  holds  pre(a)  C  Pjfi£pub.  Because  for  all  actions 
b  G  Aj_i  the  lemma  holds,  if  b  is  public,  ba  >  Af_1.  Be¬ 
cause  aa  is  a  projection,  pre(aQ)  C  Pff  U(Jfcc,6A?  add (6) 

and  therefore  a“  is  applicable  in  layer  i  (and  subsequent 
layers),  which  means  that  the  extend-RPG  ensures  that 

aa  [>  Af .  □ 


Theorem  5.  When  Algorithm  1  terminates,  resulting 
set  of  ARPGs  R  =  {7Za\a  £  A}  built  from  II  is  com¬ 
patible  with  the  RPG  1Z  built  from  II  and  there  are  no 
additional  actions,  i.e.,  A  =  AA.  Each  public  action  in 
A  has  its  projected  counterparts  in  AA  and  vice  versa, 
i.e.,  for  each  public  action  a  £  A  such  that  a  £  a  for 
some  agent  a  exists  projected  action  a 13  for  each  agent 
f3  £  A  \  a  and  a 13  £  AA.  There  is  no  projected  action 
a P  £  AA  for  which  there  is  no  original  action  a  £  A. 

Proof.  We  will  now  sketch  an  induction  which  shows 
the  compatibility  in  Theorem  5,  based  on  the  Lemma  4. 
For  the  initial  step  of  the  induction  we  take  all  ARPGs 
containing  only  the  first  fact  layer  Pff  which  is  trivially 
compatible  because  AA  =  0.  The  induction  step  is  cov¬ 
ered  by  Lemma  4,  because  each  step  of  the  Algorithm  1 
can  be  decomposed  as  an  application  of  build-RPG  and, 
if  there  are  any  applied  public  actions,  broadcasting 
their  projections  and  application  of  extend-RPG  by  all 
other  agents.  Even  though  the  algorithm  is  running 
asynchronously,  Lemma  4  holds  because  of  the  mono¬ 
tonicity  of  (A)RPGs. 

Termination  of  the  algorithm  follows  from  the  termi¬ 
nation  of  classical  RPG,  either  the  algorithm  reaches 
goal  or  a  fixed-point,  where  no  more  actions  are  added. 
Similarly  to  classical  RPG,  building  of  ARPGs  is  mono- 
tonic,  which  means  that  the  facts  and  actions  can  only 
be  added  and  because  the  set  of  actions  is  finite,  there 
must  be  a  point  where  no  more  actions  can  be  added 
and  the  algorithm  terminates.  The  detection  of  such 
situation  is  more  complicated  in  the  distributed  setting 
and  is  described  thoroughly  in  the  algorithm  section. 

The  last  statement  we  are  about  to  show  is  that  there 
are  no  additional  actions,  i.e.,  A  =  AA  (irrespective  of 
the  projections  of  public  actions)  and  that  each  public 
action  in  A  has  its  projection  in  AA  and  vice  versa.  Let 
us  assume,  that  3a  £  A  such  that  a  ^  AA,  let  us  also 
assume,  WLOG,  that  a  is  such  action  appearing  in  the 
earliest  layer  in  71,  say  A*,  and  that  a  €  a  for  some 
agent  a.  We  know,  that  exists  minimal  Apre  C  .  1 ,  t 
such  that  pre(a)  C  {p\b  £  Apre,p  £  add(6)},  because 
pre(a)  C  £a-lnt  U£pub,  for  all  actions  b  £  Apre  either 
b  £  a  or  b  is  public.  Because  we  assumed,  that  Apre  C 
A,  for  each  b  £  Apre,  if  b  £  a  then  b  £  A“_1and  if  b  ^  a 
then  b  is  public,  therefore  ba  £  A“_1.  From  the  said 
pre(a)  C  {p\b  £  Af_1,p  £  add(6)},  which  means  that  a 
is  applicable  in  A“  and  therefore  a  must  be  applied  by 
the  algorithm.  If  we  continue  with  next  such  action  we 
end  up  with  A  C  Aa. 

Now,  we  will  show  that  A  D  Aa.  Let  us  assume,  that 
3a  £  AA  such  that  a  ^  A  and  that  a  is  first  such  action. 
Similarly  to  the  previous  situation,  if  a  is  applicable  in 
some  layer  A“  then  there  is  some  set  of  actions  Apre  C 
A“_x,  which  contains  all  actions  providing  preconditions 
of  a.  Since  all  actions  b  e  Apre  are  also  in  A,  a  is 
applicable  in  A;  and  therefore  must  be  applied. 


From  A  C  Aa  and  A  D  AA  follows  that  A  =  AA.  We 
have  already  shown  that  public  actions  have  their  pro¬ 
jected  counterparts.  The  only  remaining  part  to  show  is 
that  there  are  no  projected  actions  in  AA  without  their 
respective  original  actions  in  A.  This  clearly  follows 
from  the  algorithm  itself,  because  all  projected  actions 
are  created  only  when  a  public  action  is  added  to  some 
A“  by  agent  a  and  as  shown  before,  such  action  would 
also  be  added  to  A;.  □ 

Multiagent  FF  Heuristic 
With  the  help  of  ARPGs,  the  Fast-Forward  heuristic 
estimate  can  be  straightforwardly  adapted  to  a  multia¬ 
gent  setting.  We  will  denoted  such  heuristic  as  ftMAFF. 
The  multiagent  (MA)  relaxed  plan  backing  the  hMAFF 
estimate  can  be  in  general  spread  over  all  ARPGs  of 
the  agents  in  the  team  as  illustrated  in  Figure  3.  The 
most  left  achieving  actions  has  to  be  considered  from 
all  agents.  In  the  case  of  projected  public  actions,  the 
owner  agent  has  to  define  part  of  the  the  relaxed  plan, 
possibly  using  his  internal  actions,  to  achieve  the  inter¬ 
nal  facts  of  the  provided  public  action.  Additionally, 
the  relaxed  plan  has  to  share  public  actions  which  are 
required  by  more  agents  at  the  same  layers.  The  private 
parts  of  the  relaxed  plan  provided  by  the  other  agents 
can  be  described  by  place-holding  actions  and  therefore 
no  private  information  of  the  other  agents  has  to  be 
revealed.  The  final  heuristic  estimate  is  the  count  of 
actions  of  the  MA  relaxed  plan. 

Definition.  Let  a  MA  relaxed  plan  n  be  a  solution  of 
a  MA  relaxed  problem  II  =  (£,  A,  s,  Sq),  where  s  is  the 
state,  we  are  estimating  the  cost  for,  then  1 7r|  =  h(s)  is 
the  multiagent  relaxation  heuristic  estimate. 

Similarly  to  the  relaxation  heuristic  estimate  hFF ,  we 
restrain  tt  for  hMAFF  according  to  hFF .  A  particular  ir 
is  defined  using  ARPGs  7Za  =  (P  U  A,  E)  of  all  agents 
a  £  A  built  for  the  state  s.  From  the  right  (meaning 
as  in  Figure  3),  the  relaxed  plan  n  contains  minimal 
set  of  actions  A^  C  Am,  achieving  the  goal  facts.  The 
action  layer  Am  contain  actions  of  all  agents  in  layer 
to  (ignoring  projections  of  actions,  since  the  respective 
original  actions  are  also  included  in  the  same  layer).  If 
there  is  a  frame  arc  (pm-i,Pm)  G  of  such  facts, 

i.e.,  pm  £  Sg,  the  fact  pm  does  not  need  an  explicit 
achieving  action  from  this  particular  layer  as  it  will  be 
achieved  by  an  action  from  an  earlier  (more  left)  layer. 
This  principle  effectively  selects  the  most-left  achievers 
of  a  fact  as  proposed  by  FF  heuristic.  The  action  set 
A^  induces  next  set  of  facts  across  all  agents 

Sm  =  {p\p  €  pre(a)  :  a  £  A*J, 

which  has  to  be  achieved  by  actions  from  previous  ac¬ 
tion  layer  Am-i  and  so  on  until  the  action  layer  A0  is 
reached,  where  all  the  actions  have  their  preconditions 
satisfied  by  the  initial  state  in  P0. 

Notice  that  since  this  definition  works  across  all 
ARPGs  of  all  agents,  the  resulting  i r  may  contain  ac¬ 
tions  of  different  agents. 


a: 

at-a-B 

fly-a-B-C 

at-a-C 

unload-tl-B(tl) 

at-p-B  load-a-B 

in-p-a  unload-a-C 

at-a-C 

tl: 

at-p-A 

at-tl-A 

load-tl-A 

drive-tl-A-B 

at-tl-B 

in-p-tl 

unload-tl-B 

at-p-B 

t2: 

at-t2-D 

drive-t2-D-C 

at-t2-C 

unload-a-C(a) 

at-p-C  load-t2-C 

in-p-t2  unload-t2-D  at-p-D 

Figure  3:  Multiagent  Relaxed  Plan 


We  can  compute  /iMAFF(s)  by  first  building  the  set 
of  ARPGs  R  =  {Ra\a  G  A}  for  relaxed  MA  planning 
problem  II  =  (C,A,s,Sg)  using  Algorithm  1,  then  si¬ 
multaneously  extracting  relaxed  plans  na  for  each  agent 
using  Algorithm  2  and  finally  summing  the  lengths  of 
the  resulting  relaxed  plans,  excluding  projections  of 
other  agent’s  public  actions. 

Theorem  6.  Let  TTa  D  a  be  the  computed  relaxed  plan 
of  agent  a  restricted  only  to  the  agent's  actions  (exclud¬ 
ing  all  projections  of  other  agent's  public  actions),  then 
hMAFF(s)  =  ka  n  a|  =  hFF(s). 

Proof.  The  fact  that  fiMAFF(s)  =  hFF(s)  follows  from 
the  previously  shown  compatibility  of  the  RPG  7Z  built 
for  relaxed  planning  problem  n  and  the  set  of  ARPGs 
built  from  R  =  {TZa\a  €  A}  for  relaxed  MA  planning 
problem  II.  We  first  extract  relaxed  plans  ir  for  II  and 
{7ra|a  G  A}  for  II  by  effectively  choosing  first  achievers 
of  goal  facts  and  of  preconditions  of  previously  chosen 
achievers.  It  is  clear  that  for  some  fact  p  we  choose  an 
action  a  s.t.  p  G  add(a)  only  if  a  >  A,  for  some  layer  A.t 
and  there  is  no  action  b  s.t.  p  G  add(&)  and  b  >  Aj  for 
some  j  <  i.  Because  of  the  compatibility  of  the  RPG 
and  ARPGs,  we  choose  exactly  the  same  actions  (and 
their  projections,  which  are  then  omitted)  for  if'  and  for 
{ft'aftt  G  A'},  which  means  that  |7r'|  =  X)ae.4'  Wa  1-1  a\ 
and  therefore  fiMAFF(s)  =  hFF(s).  □ 

Experiments 

The  experiments  were  conducted  on  an  implementation 
of  satisficing  version  of  MA-A* *  (Nissirn  and  Brafman 
2012)  with  various  relaxation  heuristic  estimates2.  The 
algorithm  begins  with  a  centralized  factorization  and  a 
reachability  analysis  of  a  centralized  planning  problem. 
After  the  factorization,  the  agents  are  started  receiving 
its  factors  of  the  problem  as  an  input.  The  agents  run 
in  parallel,  each  one  in  its  own  thread  and  the  messages 
are  delivered  by  an  additional  asynchronous  messaging 
thread.  Each  agent  uses  an  event  queue  to  serialize  the 
computation  and  reactions  to  incoming  messages.  If  an 
agent  finds  a  sound  plan  (with  parts  from  other  agents) 
it  prints  it  and  stops  the  distributed  process. 

Since  the  algorithm  is  asynchronous,  the  runs  are 
non-deterministic.  Therefore  we  conducted  each  exper¬ 
imental  run  as  ten  measurements.  Each  measurement 
was  limited  to  8GB  of  memory  for  the  Java  Virtual  Ma¬ 
chine  and  to  10  minutes  of  runtime.  Each  measurement 

technically  the  implementation  is  not  A*  as  the  heuris¬ 
tics  are  not  admissible,  therefore  the  used  algorithm  is  pre¬ 
cisely  MA-BestFirstSearch. 


Algorithm  2  Distributed  extraction  of  hMAFF 
Input:  ARPG  TZ  for  state  s,  having  layers 
(P0,  A0,Pi,  A1, . . . ,  An_i,P„)  and  goal  SG  Q  C. 
Output:  Relaxed  plan  R  from  s  to  Sc ■ 

1:  P  ■g-  Sg 
2:  R  <r-  0 

3:  for  i  =  n  —  1;  i  >  0;  i  4—  i  —  1  do 
4:  P'  <-  0 

5:  for  p  G  P  do 

6:  if  p  Pi  then 

7:  aGaG  A,,  such  that  p  G  add(a) 

8:  R  i —  R  U  {u} 

9:  P’  G-  P'  U  pre(a),  P  <—  P\{p} 

10:  if  a  is  projected  action  of  agent  a  then 

11:  request  Ra  for  s,  goal  pre(aorlg)  from  a 

12:  R<-RURa 

13:  end  if 

14:  end  if 

15:  end  for 

16:  PgPUP' 

17:  end  for 
18:  return  R 


was  run  on  8-core  processor  at  3.6GHz  separately.  The 
results  from  the  measurements  were  averaged. 

We  used  five  planning  domains,  four  originating  in 
the  single-agent  IPG  planning  benchmarks.  Similarly 
to  the  evaluation  of  the  algorithms  in  (Nissim,  Braf¬ 
man,  and  Domshlak  2010),  we  chose  domains  which 
are  straightforwardly  modifiable  to  the  multiagent  set¬ 
ting:  LOGISTICS  (similar  to  the  running  example,  but 
with  more  agents),  linear  logistics  (one  package  has 
to  be  transported  stepwise  by  all  agents),  rovers,  and 
satellites.  As  in  (Komenda,  Novak,  and  Pechoucek 
2013),  we  have  extended  the  set  of  IPC-based  domains 
by  a  coordination  domain  COOPERATIVE  PATIIFINDING, 
in  which  robots  on  a  grid  are  tasked  to  switch  their  posi¬ 
tions  not  colliding  with  each  other.  We  tested  following 
distribution  strategies  of  FF  heuristic  estimation: 

•  /iFF  using  only  locally  built  RPGs  (including  projec¬ 
tions  of  public  actions)  and  local  estimation  of  Fast- 
Forward  heuristic,  as  proposed  in  (Nissim  and  Braf¬ 
man  2012). 

•  /jMAFF  using  distributed  ARPGs  based  on  Algo¬ 
rithm  1  and  distributed  extraction  of  FF  heuristic 
as  described  in  Algorithm  2. 

•  Lazy  hMAFF  using  only  locally  built  RPGs  (including 
projections  of  public  actions)  and  distributed  extrac- 
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Table  1:  Experimental  results  for  the  heuristics.  \A\  is  number  of  agents,  l  is  sequential  length  of  the  plan  ( l *  is 
optimal),  t  is  duration  of  the  search  in  seconds,  v  is  a  number  of  visited  states,  cs  is  a  number  of  search  messages 
(each  of  size  of  a  state),  cr  is  a  number  of  messages  building  ARPGs  (each  of  size  of  a  projected  public  action) 
and  Ch  is  a  number  of  messages  for  the  heuristic  estimate  (each  of  size  of  a  partial  relaxed  plan).  As  hFF  do  not 
build  distributed  ARPGs,  cr  and  c/,  are  always  zero.  The  domains  are  COOP,  pathfinding  (cp),  logistics  (log), 
linear  LOGISTICS  (llg),  ROVERS  (ROV)  and  SATELLITES  (sat).  Runs  denoted  as  -  did  not  finish  in  the  limits. 


tion  of  FF  heuristic  as  in  Algorithm  2  with  additional 

information  on  reachability  of  projected  actions. 

The  implementation  we  used  is  a  preliminary  prototype 
which  is  not  competitive  with  the  current  single-agent 
state-of-the-art  planners,  but  nevertheless  it  gives  in¬ 
sights  in  comparison  of  the  heuristics. 

The  hMAFF  is  based  on  the  theory  presented  in  the 
previous  section.  For  each  state  each  of  the  agents 
computes  the  heuristic  estimate,  i.e.,  a  complete  set 
of  ARPGs  is  built.  In  order  to  manage  the  distribu¬ 
tion  and  asynchronism,  the  ARPG  building  algorithm 
slightly  differs  from  Algorithm  1  so  that  ARPGs  for  sev¬ 
eral  different  states  can  be  built  simultaneously.  The 
heuristic  estimate  is  then  extracted  using  Algorithm  2. 

In  Lazy  hMAFF ,  the  ARPGs  are  built  lazily,  i.e.,  an 
agent  builds  an  ARPG  for  its  current  search  state  using 
its  actions  and  projections  similarly  as  in  hFF ,  then  the 
Relaxed  Plan  (RP)  is  extracted  and  only  when  some 
projected  action  is  added  to  the  RP,  request  is  sent  to 
the  owner  of  the  original  action.  The  owner  then  builds 
an  ARPG  from  the  given  state  to  preconditions  of  the 
original  actions,  extracts  a  RP  using  the  same  proce¬ 
dure  and  sends  back  the  computed  RP.  The  returned 
RP  is  then  merged  with  the  original  one.  This  effec¬ 
tively  forms  a  distributed  recursion  algorithm.  Such 
algorithm  significantly  lowers  communication  load  and 
enables  the  agents  to  search  in  parallel,  especially  in 
loosely  coupled  problems.  In  addition  to  this,  reach¬ 
ability  analysis  using  Algorithm  1  can  be  done  before 
starting  the  search  to  improve  the  estimate  of  applica¬ 
bility  of  the  projected  actions. 

The  results  in  Table  1  show,  that  hFF  is  fast  and  can 
effectively  solve  smaller  problem  instances,  but  it  is  not 


much  informed,  as  illustrated  by  the  number  of  visited 
states.  This  becomes  critical  in  larger  problems.  On 
the  other  hand,  hMAFF  is  better  informed,  but  it  has  to 
build  all  ARPGs  for  each  state  which  is  estimated.  This 
is  extremely  communication  intensive  as  shown  by  the 
number  of  exchanged  ARPG  messages  (cr).  Also  the 
possibilities  of  parallel  computation  are  reduced  by  the 
fact  that  all  agents  have  to  build  the  ARPGs  for  each 
estimated  state. 

The  best  performance  is  given  by  the  Lazy  ftMAFF. 
This  implementation  keeps  the  heuristic  estimate  qual¬ 
ity  of  hMAFF,  but  since  it  does  not  build  ARPGs  dur¬ 
ing  the  search,  but  only  local  RPGs  enriched  by  the 
projections  of  other  agents’  actions,  the  RPGs  can  be 
built  lazily  only  for  those  states  where  any  interaction 
between  the  agents  is  needed  and  only  those  agents  in¬ 
volved  build  the  RPGs.  The  ARPGs  are  computed  only 
in  an  initial  reachability  analysis,  which  can  be  omitted, 
but  which  significantly  improves  the  results. 

Final  Remarks 

Our  formal  treatment  and  design  of  algorithms  for  com¬ 
puting  distributed  Relaxed  Planning  Graph  for  mul¬ 
tiagent  planning  can  be  seen  as  a  first  step  towards 
efficient  MA  planners  based  on  satisficing  state-space 
search  techniques  utilizing  relaxation  heuristics.  Fur¬ 
thermore,  we  have  experimentally  shown  that  appro¬ 
priate  implementation  of  a  decentralized  estimation  of 
a  global  relaxation  heuristic  can  radically  improve  com¬ 
putational  and  communication  efficiency  of  the  plan¬ 
ning  process  as  a  whole. 
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